Creating a Sentence-level Syntactic, Morphological and Lexical Feature Identification Application

Ross Hendrickson and Dr. Deryle Lonsdale, Linguistics Department

Elicited imitation (EI) was originally used to study L1 acquisition in children and has since been applied to L2 oral proficiency testing (Chaudron, 1994). The PSST research group I am involved with has been researching EI as a viable method of language proficiency testing for some years now. An EI test consists of a series of sentence-length items that a subject must repeat verbatim after hearing each one. The sentences must be carefully chosen for their lexical, morphological, syntactic, and semantic content as well as for their length (usually counted in syllables). The primary impetus for this research project lies with EI item creation and item analysis.

Creating an automatic sentence analysis application to aid in the creation of EI test items was a challenging yet rewarding task. The final program is essentially an automatic corpus creation application with the specific goal of creating massive amounts of test questions for EI exams. I combined several open-source, proprietary, and home grown resources into a unified application that can take any English text and identify approximately 25 different features at the sentencelevel and then store these “tagged” sentences in a sentence-level annotated corpus. Each sentence can then be retrieved based on its individual features. The desire to create the application came after spending countless hours analyzing previous test items by hand and creating new ones to match specific criteria. I have entitled the application “Sendo”.

Sendo is a Java based GUI application. I used the Stanford parser and part of speech tagger for syntactic parsing. For morphological analysis I used a dictionary look-up method to extract information using the CELEX 2.0 database. I created another look-up method to figure out the average lexical density of a sentence. The lexical method used the British National Corpus as its source for how frequent a particular word was. I created a syllabification method that first checks the Celex 2.0 database for the word, and if the word is not found the method then uses a heuristic I created for calculating the number of syllables in a given word. Each non-numerical feature is encoded as a complex parse-tree-based regular expression. Features were chosen for their relevance to a prior research paper (Hendrickson, 2008) and their ease in translation into a parsetree- based regular expression. A parse-tree-based regular expression is simply a way of programmatically encoding syntactic patterns to find parse trees with the same pattern.

The mechanics described above are combined together in order to discover specific sentencelevel features. One example of a feature would be whether the sentence contained the past tense. This would be done via a parse-tree-based regular expression that simply searches the parse tree for the part-of-speech tag VBD (Verb, past tense) or VBN (Verb, past participle). Other features incorporated are a morphological complexity measure derived by taking each word, breaking it into its discrete morphemes and calculating an assignable value. Another feature is an average frequency of content words in the sentence. I built the application to use a scalable feature engineering system so I or my colleagues could easily add more features identified as of interest in the future. Currently one PSST colleague is actively pursuing extending the current feature set. I designed the application to combine the mechanics and feature engineering to perform four specific tasks. The first was I wanted users to be able to analyze visually a single sentence so I created a graphical representation of the parse tree created from the sentence. In addition to the visual information a detailed output of features found in the sentence is also provided. The second task I wanted users to be able to feed into the application papers or blogs or basically any collection of multiple sentences at once.

I designed the application to perform sentence disambiguation on the document provided and then automatically parse each sentence into the corpus. The third task I wanted users to capitalize on was the ability to parse through an entire Penn-Treebank- style corpus and run feature recognition across all the sentences encountered. The whole-corpus-analysis method allows tens of thousands of sentences to be analyzed, annotated appropriately and then inserted into the corpus of test items. This ability combines powerfully with the final task of sentence retrieval. The sentence retrieval process allows the user to select features and set thresholds for numeric features (ex: find items 15 syllables or less that are past tense and have a possessive). This simple on the surface ability allows us as researchers to create a 60-item test with specific test items from natural language that adhere to the very tight constraints set by the nature of EI exams in almost minutes, whereas it used to take weeks. The corpus of sentences also allows us consider each sentence as a potential test item. With a corpus of thousands of test items we have begun exploring adaptive testing.

I presented Sendo at a pre-CALICO workshop on automatic analysis of learner language in Phoenix on March 10th 2009 as the primary author and Dr. Lonsdale as my co-author. The experience was incredible. I was able to meet many respected researchers and get feedback and suggestions from them about Sendo. My method of using parse-tree-based regular expressions was not unique at the conference so I had an opportunity to discuss specific features and how to code them with Xiao Fei Lu from the University of Pennsylvania. My experience in Phoenix was a huge part in my decision to pursue graduate school in Computational Linguistics. Members of the PSST research group here on campus are currently using the application for both item analysis and item generation. A colleague and I are currently designing a test that will be administered in December 2009 and all the test items are from a corpus annotated by Sendo. I am also evaluating Sendo for accuracy by comparing human and computer annotated data.

Sendo pushed my linguistic and programming abilities. One of the major challenges was figuring out a way to create a framework that would allow me to plug in different information derived from different applications and save it all in a format that allowed for easy retrieval. My interactions with my mentor were an essential part of my success with Sendo. My mentor directed me towards appropriate resources as well as helped me resolve design theory concerns. Sendo is already being added to by another member of the group seeking to use the extensible framework to discover more features. Sendo will continue to be a focus of my efforts as I work to improve its accuracy, its efficiency and its scope. I believe the experience designing, developing, and presenting Sendo will be invaluable as I pursue higher degrees and am grateful for how the ORCA grant has empowered me.

Works Cited

Chaudron, C. (1994). Elicited imitation as a measure of second language competence. Research methodology in second-language Acquisition, 245-261.
Hendrickson, R. Eckerson, M., Johnson, A., McGhee, J. (2008) What makes an item difficult? A syntactic, lexical, and morphological study of Elicited Imitation test items. 2008 Second Language Research Forum.

Brigham Young University

Journal of Undergraduate Research

Creating a Sentence-level Syntactic, Morphological and Lexical Feature Identification Application

Ross Hendrickson and Dr. Deryle Lonsdale, Linguistics Department

Works Cited