Pace, Emily Adriana
Using Technology to Enhance Second Language Literacy
(Subtitle- Enablling Students of Arabic to Find..)
Faculty Mentor: R. Kirk Belnap, Department of Asian and Near Eastern Languages
This project is the beginning of a revision of a previous research project of the BYU Arabic department known as RAFT (Readable Article Finding Tool). This tool aims to assist learners of second languages with suitable reading material by using a list of vocabulary known by the user and providing the user with a set of articles that contain a high percentage of those known words.
Recognizing a high percentage of vocabulary in a text aids readers’ comprehension and enjoyment. Finding authentic texts with a high readability index for students of Arabic who have a rather limited vocabulary is a time-consuming task. RAFT (Readable Article Finding Tool) is a first step in this direction.
Literacy leads to greater fluency for learners of second languages. Reading is critical in building one’s vocabulary, which is an important step for L2 learners in achieving higher levels of fluency. ESL learners have the benefit of a generous supply of graded readers. However, there are no such resources for students of languages such as Arabic, Hebrew, Persian, and Turkish. Thus, finding authentic texts in these languages with a high readability index for readers with a rather limited vocabulary is a time-consuming yet highly valuable task.
RAFT allows registered users to identify the readability level of a particular text through the percentage of word forms known in the text. The texts must first be morphologically analyzed to identify the lemmas of the words in the text; then the readability level is computed against lists of vocabulary known to the learner or texts with which they are already familiar.
The focus over the past year has been to re-create RAFT in a programming language that allows for more user-friendly editing in the future, and to learn to run files through MADAMIRA to obtain parsed Arabic that can be compared to other files in a more complete way.
A trial version of RAFT was written in Python, which allows for the foundation of an updated program. The Python script results in a small program that creates an executable file. After the program creates a directory of text files retrieved from the BBC Arabic website, the user can input their vocabulary by browsing to a directory of a collection of text files containing their vocabulary in either list form, or simply previously mastered articles. The files do not need to be processed in any form other than to be UTF-8 encoded.
The program compares the vocabulary in the given directory to the content of the articles that it pulled from the BBC Arabic articles. The current form of the project only compares the exact forms of the words, as it has not yet been linked with the lemmatizer that would allow it to compare the words to each other in their base form. The program then creates a window stating the percentage of known words in each of the news articles and gives a reference number that correlates with the articles retrieved when the program is opened. This allows the user to easily select the texts they read based on the percentage of known words.
The second part of this project involves working with the MADAMIRA files acquired from Colombia University. The program runs from a local server, parsing through a preprocessed XML file containing the necessary headers and input for the file. A key element in the preprocessing is the breaking up of the text into segments that are later used when processing the output.
The input file is processed in the MADAMIRA server, and results in an output XML file containing information for each lemma in the file. The output information includes information for each lemma in each segment, and contains linguistic information about the lemma such as word form, gender, aspect, case, person, and an English translation.
The next step will be to link the RAFT program with the MADAMIRA parser. The largest challenge in this process will be the preprocessing of the input files in order to facilitate compatibility between RAFT and MADAMIRA.
While parsing the output, certain aspects to look at include determining how similar a word in the news article is to the user’s vocabulary files and the determination of that word being included in the percentage of known words. Additionally, a basic list of words that the user would be assumed to know, such as connectors, prepositions, pronouns and articles, will be included in the program so as to ensure they are not counted among the percentage of unknown words if they are not included in the user’s vocabulary list.
The merging of the two programs will take place in the form of a website, read-arabic.org, which will allow users to run the program while not directly accessing or obtaining a copy of the MADAMIRA files, thus maintaining copyrights associated with the MADAMIRA program.
This readability tool currently only focuses on the percentage of vocabulary that the user has learned, while not considering any other traditional readability measurements used in L2 acquisition, such as content, word frequency, and sentence complexity. While content is controlled in this prototype of the program by only retrieving articles from the BBC Arabic news site, future applications of this program could allow users to point the search material to a wider set of websites, or to even compare a stored set of articles containing material such as short stories, culturally historic narrations, or other literary materials.
Future applications of this project will potentially facilitate the learning of Arabic as a second language by giving discouraged students a tool to give them more control over their language acquisition. When used with news websites, it is also a powerful tool to not only maintain and expand vocabulary, but also to do so in a context that is relevant to current events in the world. This can also create a collection of readers that students can use not only when learning the language, but to also maintain their language capabilities far past the beginning and intermediate levels, as the reading material does not stagnate.
While this program is specified to Arabic, it can easily be expanded and applied to any language for which a language parser and lemmatizer is available. This provides a potentially endless amount of classroom material for virtually all students in any language classroom.