Patterns of errors in texts written by Costa Rican university English learners: A corpus-aided study

Bonilla López, Marisela; Bonilla López, Marisela

doi:10.15517/aie.v23i1.51485

Servicios Personalizados

Revista

Articulo

Indicadores

Citado por SciELO
Accesos

Links relacionados

Similares en SciELO

Otros
Otros

Permalink

Actualidades Investigativas en Educación

versión On-line ISSN 1409-4703versión impresa ISSN 1409-4703

Rev. Actual. Investig. Educ vol.23 no.1 San José ene./abr. 2023

http://dx.doi.org/10.15517/aie.v23i1.51485

Artículos

Patterns of errors in texts written by Costa Rican university English learners: A corpus-aided study

Patrones de errores en textos escritos por aprendices universitarios de inglés en Costa Rica: Un estudio asistido por corpus

Marisela Bonilla López¹
http://orcid.org/0000-0002-1194-7721

^¹Docente propietaria e investigadora de la Escuela de Lenguas de la Universidad de Costa Rica, San José, Costa Rica. Doctorado en Lingüística de la KU Leuven en Bélgica. Orcid: https://orcid.org/0000-0002-1194-7721Dirección electrónica: marisela.bonilla@ucr.ac.cr

Abstract

The present corpus-aided study sought to identify the grammatical and non-grammatical second language (L2) error patterns of Costa Rican university English learners at all academic levels of a public university. Specifically, a total of 360 English as a foreign language learners, who were enrolled in the B.A in English or B.A. in English Teaching during the second semester of 2019, took the Quick Oxford Placement Test to ascertain their English Proficiency level and composed an argumentative text to elicit their written errors. Results from the placement test showed that the participants' proficiency level ranged between B1 (low intermediate) and C1 (low advanced). In addition, the quantitative nature of the study required not only converting the handwritten compositions with a speech recognition software but also identifying and tagging all L2 errors with a tagging system. Analyses of a statistical software for data management revealed that the learner corpus contained a total of 33 L2 error patterns, which were classified as follows: 17 grammatical, 10 stylistics, and 6 lexical. Main descriptive findings indicated that although some error frequencies lowered to the point of having none as learners advanced in the major (e.g., capitalization and superlatives), other linguistic problem areas persisted all throughout (e.g., word form errors, fragments, and word order). Concluding remarks highlight that because the error frequencies of some L2 error categories still ranked high over time, learners' L2 knowledge of lexical, syntactic, morphological, and stylistic domains could need more expert input (in the form of explicit instruction and/or feedback) depending on the complexity of the target structure.

Keywords foreign languages; university students; Linguistic research; Writing

Resumen

El presente estudio asistido por corpus buscó identificar los patrones de error gramaticales y no gramaticales de la segunda lengua (L2) de los aprendices de inglés en todos los niveles académicos de una universidad pública. Específicamente, para determinar el nivel de competencia en inglés y para obtener los errores escritos, un total de 360 estudiantes de inglés como lengua extranjera, matriculados en el Bachillerato en inglés o el Bachillerato en la Enseñanza de Inglés durante el segundo semestre de 2019, completaron el Quick Oxford Placement Test y escribieron un texto argumentativo, respectivamente. Los resultados de la prueba de ubicación mostraron que el nivel de competencia de las personas participantes osciló entre B1 (intermedio bajo) y C1 (avanzado bajo). La naturaleza cuantitativa del estudio requirió no solo convertir las composiciones escritas a mano con un software de reconocimiento de voz, sino también identificar y etiquetar todos los errores L2 con un sistema de etiquetado. Los análisis de un software estadístico para la gestión de datos revelaron que el corpus contenía un total de 33 patrones de error de L2, los cuales se clasificaron de la siguiente manera: 17 gramaticales, 10 estilísticos, y 6 léxicos. Los hallazgos descriptivos principales indicaron que, aunque algunas frecuencias de error se redujeron hasta el punto de no tener ninguno a medida que el estudiantado participante avanzaba en la carrera (por ejemplo, mayúsculas, superlativos, modales y cuantificadores), otras áreas de problemas lingüísticos persistieron independientemente del nivel académico (por ejemplo, errores de forma de palabras, fragmentos, y orden de las palabras). Las observaciones finales destacan que debido a que las frecuencias de error de algunas categorías no disminuyeron con el tiempo, el conocimiento de los dominios léxicos, sintácticos, morfológicos y estilísticos de L2 de los estudiantes podría necesitar más aportes de expertos (en forma de instrucción explícita y/o retroalimentación) dependiendo de la complejidad de la estructura de meta.

Palabras clave Lengua extranjera; Estudiante universitario; Investigación lingüística; Expresión escrita

1. Intrduction

The latest edition of the world's largest ranking of countries by English skills, carried out by EF EPI (^{Education First English Proficiency Index, 2021}), indicates that out of 112 countries, Costa Rica ranks 44 and has moderate English proficiency with a score of 553 (vis-à-vis the Netherlands, which has very high proficiency and ranks 1st on the list with a score of 663). Some may argue that a survey of this type cannot by any stretch paint a completely accurate picture due to sampling procedures^⁽¹⁾, yet it certainly shows a preview of a larger reality: that the English proficiency level of Costa Rican pupils and youngsters seems to be stagnant. In fact, recent news reports pose a problem that authorities of the Ministry of Public Education (MEP in Spanish) have yet to grapple with. In 2021, the Foreign Language Assessment and Training Program (PELEX in Spanish) from the School of Modern Languages of the University of Costa Rica administered a language competence test in all public high schools nationwide. The results were not encouraging: 64% of the students were placed on the A1 or A2 band based on the Common European Framework of Reference for Languages. Such results did not yield the B1 minimum that MEP was hoping for, and they certainly do not look promising to reach bilingualism by 2040 (^{Ruiz, 2022}).

The foregoing implies that teaching English could represent a daily challenge in the life of Costa Rican second language (L2) practitioners generally and L2 writing teachers particularly, especially considering that any L2 issue that highschoolers may drag could show at the university level. Hence, this scenario calls for one action that could be useful in furthering knowledge of English teaching in a context with a clear educational need: learner corpus research. This line of inquiry “has primarily relied on collecting and analyzing second language … learner writings” (Granger, 2008 cited in ^{Alexopoulou et al., 2017}, p. 1) with the purpose of, among other things, identifying frequency of use of given L2 structures (^{Neff et al., 2004}) and ascertaining areas of L2 struggle (^{Arjan et al., 2013}). Indeed, the fact that learners' language collections can be computerized has made it possible to have large learner corpora of over 40 million words (e.g., the Cambridge Learner Corpus) as well smaller ones collected with a specific research purpose in mind (^{Díaz-Negrillo, 2009}).

However, there is narrowed down corpus data about the linguistic problems of EFL learners from various first language (L1) Spanish backgrounds: available knowledge emerges mainly from EFL learners in Spain—be it from large (^{Díaz-Negrillo and Valera, 2010}) or small learner corpora (^{Díez-Bedmar, 2005}). What is more, to the best of the researcher's knowledge, no major college wide study has been conducted in the context of this investigation. Hence, in an attempt to assist in the understanding of EFL in Costa Rica generally and from an undergraduate standpoint specifically, the present corpus-aided investigation seeks to identify the L2 error patterns of Costa Rican university writers at all academic levels of an English major. Specifically, the research question that guided this study was the following: What are the grammatical and non-grammatical L2 error patterns of university writers across academic levels of an English major of a public university?

2. Theoretical background

With the advent of Contrastive Analysis (CA) in the late 1950s and Error Analysis (EA) in the 1960s, researchers sought to analyze learners' L2 errors by looking for differences between learners' L2 and first language (L1) (i.e., CA) and to classify L2 learners' errors to explain what caused them (i.e., EA) (for a review, see ^{Bitchener and Ferris, 2012}). From these studies (^{Bhela, 1999}), it was possible to determine that learners' L1 may have an influence on L2 written inaccuracies. Specifically related to Spanish L1, different researchers have shed light on the nature of errors of speakers learning English as a FL. One such example is ^{Alonso (1997)}, who conducted a study with twenty-eight first year EFL high school students in Spain. According to the author, errors from compositions about the last film the participants had seen were mostly interlingual errors, that is, those “that reflect the learner's first language structures” (^{Dulay et al., 1982}, p. 23). Similar to ^{Alonso (1997)}, Calsín (2011, cited in ^{Vargaya, 2019}) analyzed the participants' texts—in this case, 4th and 5th year Linguistics and English students—and found Spanish L1 influence on written errors related to the absence of the –s for the third person conjugation in simple present tense (omission error), the unnecessary addition of –s in adjectives (addition error), and the lack of accuracy in placing the adverbs of frequency in the correct order (lack of sentence order).

Nevertheless, criticism to EA and CA theories because they were too limited in their focus (^{Bitchener, 2016}; ^{Ellis, 1994}), on the one hand, and the incorporation of computers in data collection, on the other hand, shifted empirical efforts to a line of inquiry with a methodology that studies language use beyond the causes of L2 errors and L1 comparisons to understand them: that is, corpus linguistics. ^{Lindquist (2009)} defines corpus as “a collection of texts which is stored on some kind of digital medium and used by linguists to retrieve linguistic items for research or by lexicographers for dictionary-making” (p. 3). As a result, there are large native corpora that contain all sort of samples of English, which is the most studied language thus far (^{Granger, 1998a}). Some of these are the Brown/Frown Corpus, the London-Lund Corpus of Spoken English (LLC), the Bank of English (BoE), the British National Corpus (BNC), the Corpus of Contemporary American English (COCA), and the International Corpus of English (ICE). Interestingly, the emergence of native English corpora made it clear that there was also a need for corpora that studied English as used by L2 learners, hence the term learner corpora (^{Díaz-Negrillo, 2009}; ^{Nesselhauf, 2004}). Among the most prominent learner corpora are the International Corpus of Learner English (ICLE), the Longman Learners' Corpus (LLC), and the Hong Kong University of Science and Technology (HKUST) learner Corpus (for a comprehensive list, see ^{Lindquist, 2009}; ^{Pravec, 2002}). Currently, learner corpora “give us access not only to errors but also to learners' total interlanguage” (^{Granger, 1998b}, p. 6)^⁽²⁾. One lack, however, is that much of the understanding of English errors at a university level comes from seminal work on native-speaker corpora (Connors and Lunsford, 1988; ^{Hodges, 1941}; ^{Johnson, 1917}; ^{Lunsford and Lunsford, 2008}; ^{Witty and Green, 1930)}, and when studies with EFL university learners have been conducted, the context is situated mainly in Europe (e.g., ^{Dagneaux et al., 1998}) and Asia to a lesser extent (^{Narita, 2013}).

Thus, few of the investigations on the overall written production of Spanish L1 EFL university writers are Díaz-Negrillo and Valera (²⁰¹⁰), ^{Neff et al. (2004)}, and Díez-Bedmar (²⁰⁰⁵), out of which just two explore learners' errors. To illustrate, Neff et al. investigated fourth-year university learners' lexico-grammatical patterns of writer stance (e.g., it is + (adverb) adjective + that; it is + (adverb) said/thought + that) and compared them with those of professional writers and English L1 university students. The participants were EFL writers whose first languages were Dutch, Belgian-French, Italian, and peninsular Spanish and the language samples were extracted from ICLE. Main results showed an overuse of it is + (adverb) adjective + that and the agentless passive by the EFL learners, whereas the it is + adjective pattern showed no significant differences. Different from Neff et al., Díez-Bedmar (²⁰⁰⁵) analyzed first-year students' essays to identify L2 learners' written errors at a morphological, syntactic, semantic, and pragmatic level. Overall, the findings revealed that some of the most problematic areas were punctuation, spelling conventions, verb tenses, and articles. Then, as an error frequency study, Díaz-Negrillo and Valera (²⁰¹⁰) examined a sample of the Non-native Corpus of English (NOCE, ^{Díaz-Negrillo, 2009}) and found a complex picture where comma usage, for example, seemed highly problematic along with lexical issues such as wrong word choice.

Clearly, albeit their significant findings, previous investigations are not enough to gain sound insight into Spanish L1 EFL learners' interlanguage and to inform in turn L2 educators and researchers alike. Consequently, the need to further broaden current knowledge of Spanish L1 EFL university writers at different academic levels in Costa Rica inspired this study.

3. Methodology

3.1 Approach

Different researchers agree that the word corpus speaks of a methodology being used rather than a topic in linguistics being studied (e.g., ^{Díaz-Negrillo, 2009}; ^{Lindquist, 2009}; ^{Nesselhauf, 2004}). For instance, currently “corpus is almost always synonymous of electronic corpus, i.e., a collection of texts which is stored on some sort of digital medium and used by linguists to retrieve linguistic items for research or by lexicographers for dictionary-making” (^{Lindquist, 2009}, p. 3). Against this background, the present quantitative study used corpus methods both to create the learner corpus from the participants' written samples of the second semester of 2019 (i.e., IIC2019) and to display the ensuing descriptive findings (see 3.4 for a detailed description). Indeed, in terms of current distinctions in corpus linguistics (i.e., corpus-based, corpus-driven, and corpus-aided), this study is corpus-aided (also known as corpus-supported) because corpora are used to find illustrative examples of, in this case, L2 error patterns (see ^{Lindquist, 2009}, p. 26 for a description).

3.2 Participants and context

This study took place at the School of Modern Languages from the University of Costa Rica (UCR), a public university located in San José at Rodrigo Facio Branch in IIC2019. Specifically, to create the written learner corpus, only courses with a writing component were visited across all academic levels of the English major: first year (Integrated English I and Integrated English II), second year (English Composition I and English Composition II), third year (English Rhetoric I and English Rhetoric II), and fourth year (English Rhetoric III and English Rhetoric IV). Hence, the selection criterion was purposive. In its initial stage (see 3.3.2), consent forms from 383 individuals were gathered, but after discarding those whose data were not complete due to absenteeism (n = 20) and those whose L1 was not Spanish (n = 3), the total number of participants was 360 (male = 61.9%, female = 38.1%) and distributed as follows: first year (n = 78), second year (n = 123), third year (n = 95), and fourth year (n = 64). The large majority of the EFL participants (Mage = 23, SD = 5.52) were Costa Rican (n = 355). The rest came from countries such as El Salvador (n = 1), Venezuela (n = 1), Nicaragua (n = 2), and Colombia (n = 1). Thus, in all cases, the participants' L1 was Spanish. As for their English proficiency, it differed by academic level: first year (low intermediate; SD = .807), second year (low intermediate; SD = .741), third year (high intermediate; SD = .805), and fourth year (low advanced; SD = .889).

3.3 Design

3.3.1 Instruments

3.3.1.1 Learner Profile Sheet

The participants completed a learner profile sheet to provide not only their general personal information but also their specific background information related to their L1 and L2 history (See Appendix A).

3.3.1.2 Placement Test

To ascertain learners' proficiency level, the Oxford's Quick Placement Test (OQPT) was administered (see results in 3.2). The exam could be completed in two versions: online if—based on the course schedule—a language laboratory was available at the time of administering the instrument or print if such availability was not present.

3.3.1.3 Argumentative texts

To create the learner corpus, the participants were provided with a list of six prompts (See Appendix B). Opinion writing (i.e., argumentation) was chosen because it was the only rhetorical pattern that all learners had had some exposure to across all academic levels. Any other rhetorical pattern (e.g., comparison/contrast or cause/effect) would not have given learners equal writing conditions. With this is mind, a specific number of words was also not required. They were, however, encouraged (irrespective of the prompt of their choice) to explain their reasons clearly and use examples from their own experience to support their ideas. This was done to maximize the chances of a similar text length across levels. All compositions were written on paper since there was no availability of language labs at the time of writing the texts. After conversion of the texts to an editable format (see 3.4), the total number of words in the learner corpus was 57 054 (M = 158.4, SD = 43.6). As for average length per year, it was as follows: first (Sum = 8871, M = 113.7, SD = 32.8), second (Sum = 19831, M = 161.2, SD = 38.5), third (Sum = 17094, M = 179.9, SD = 35.7), and fourth (Sum = 11258, M = 175.9, SD = 35.7).

3.3.2 Procedures

Conversations with course instructors preceded the two-week data collection process. Those meetings were necessary to discuss logistics, namely, the chronogram, class time availability, and number of students in the course. Then, Week 1 was spent asking for the participants' consent as well as administering the learner profile sheet and the placement test. On the one hand, the consent form part (i.e., the explanation of the research objective, the summary of both the benefits and implications of participating, and the wait for the signatures in class) took 10 minutes approximately. On the other hand, the allotted time for completing the learner profile sheet and the placement test was 30 minutes.

A week later (Week 2), learners had the chance to choose one writing prompt and develop the answers in the sheets provided. They had 30 minutes to complete the task. Because no language lab was available at the time of writing, all argumentative compositions were pen-and-paper texts. However, if one was available during the schedule of test taking, learners were able to take the online version of the proficiency test.

3.4 Data coding and analysis

After the two week-long data collection period, all handwritten compositions (N = 360) were converted into a digital document. To transcribe all texts, the speech recognition software Dragon Naturally Speaking was used. Whenever the software was not able to transcribe an error, it was inserted manually. Then, drawing on Bonilla et al., (²⁰¹⁷), each converted text was assigned a code that contained the following information: the setting, the year of data collection, the native language, the target language, the proficiency level, and the participant number (e.g., UCR-20-SP-EN-B1-92). The purpose of coding each text was to keep the data coding anonymous.

Specifically, as in previous reports on college writing errors (^{Lunsford and Lunsford, 2008}), all errors present in the text were marked, meaning that error types “emerge(d) out of the data rather than being imposed on them prior to data collection and analysis” (^{Patton, 1990}, p. 306). Thus, after having traced all existing error types and confirmed acceptable interrater and intrarater reliability (see Cronbach's alpha values in Table 1)^⁽³⁾, thirty-three error types were identified—all of which belonged to either of the grammatical (n = 17) and non-grammatical error categories (n = 16). The latter was then further subdivided as (i.e., stylistics) (n = 10) and lexical (n = 6) for a more fine-grained analysis. All throughout the reference manual was A Comprehensive Grammar of the English Language (^{Quirk et al., 1985}).

Table 1 Reliability (Cronbach's alpha) for interrater and intrarater consistency per error type

Index	Grammar	Stylistics	Lexis
Interrater	.88	.91	.85
Intrarater	.92	.96	.94