Jonathan Dehdari and Dr. Deryle Lonsdale, Linguistics
The Persian language, or Farsi, stands in a unique position in the international scene for political, religious, and historical reasons. With over 60 million speakers in Iran, Afghanistan, and Tajikistan, interest has been growing lately in the West in using current technologies to better manipulate information in Farsi. In order to utilize Farsi data effectively, computers must be able to understand its morphology, or the individual components of a word. This understanding of morphology is necessary because of the language’s productive usage of affixes, as well as the substantial interaction between its morphology and its phonology and orthography. I have been developing a morphology engine for Farsi that can both recognize and generate inflected Persian words.
This project has been designed from the onset to provide a set of tools to allow anyone to manipulate Farsi data for any purpose, including research, pedagogy, or integration with other linguistic software. A two-level approach allows maximum flexibility for this objective. A twolevel morphology works by having two manifestations of any given word, a lexical and a surface manifestation. Rules define the relationship between the two.
For example, in English one might encounter in any text, such as a newspaper, the word typed. However, a simple dictionary lookup of such a word would not prove successful since the morpheme ed is attached to the end of the word type. Thus we could say that the lexical, or dictionary, manifestation of the word is type+ed while the surface, or written, manifestation is typed. The two-level approach would reconcile the two, describing them as type+ed : type00d, where the lexical morpheme boundary + is manifested on the surface as nothing, or 0. Another example is fly+0s : fli0es where environmental rules would reconcile y to i, + to 0, and 0 to e.
The Persian implementation works this same way, using Farsi morphemes and morphophonemic rules. An actual, voweled word like mifahmam, ‘I understand’, would be recognized as mi+fahmi+am, where the surface morpheme fahm corresponds to the lexical entry fahmi. Likewise na+mi+xor+am would generate nemixoram, with na manifesting itself on the surface as ne in this environment.
Note that despite the fact that some morphemes manifest themselves differently on the surface (eg. na : ne), the lexical entries in no way transform into surface entries. That is, the lexical form is always present, just manifested differently as defined by a corresponding rule. This is important because no information gets lost through multiple changes of a word. Why does this matter in the real world? This matters because applications that use this project, such as machine translation, could potentially translate in both directions.
The two-level morphological framework I used was PC-Kimmo, which uses finite-state automata to analyze and generate words based on a lexicon and a rules file. A rule that defines a particular morphophonemic relationship is written for PC-Kimmo and can be compiled into a finite-state table using K-Gen. One such rule is ‘e:o => # b __ +:0 C o’, which states that the imperative prefix be sometimes is pronounced bo only when the following syllable nucleus is o.
While the rules file is used for both recognizing and generating, the lexicon files are only used for recognizing. Each lexical entry includes a lexeme, a part of speech marker, a morphophonemic alternation (that which is allowed to follow the lexeme), and optional glosses. I used a Romanized alphabet for the Farsi lexicon files due to the lack of full Unicode support in PC-Kimmo, Perl 5.6, and several shell tools. However a strict 1:1 correspondence to the orthography of Farsi allows easy transfer between the two alphabets.
To date there are 1120 fully voweled lexical entries with English glosses and languages of origin for foreign words. Most of the morphophonemic rules have been written, but refinement of these rules and of the alternations still need to be addressed.
I put this project to the test using two written corpora: the June 12, 2002 edition of the Kayhan online newspaper consisting of 16404 tokens, and the book “Jurassic Park” in Farsi consisting of 21184 tokens. I first wrote a Perl script to convert the Farsi writing to Romanized format because both of these corpora are written in native Persian orthography. Then I wrote a Perl script to remove the short vowels from the lexicon files for PC-Kimmo to recognize words from a written corpus. This is currently the default arrangement for the files available to download.1
How does this project stand up in real world usage? The first 6 chapters of “Jurassic Park” achieved 72% recognition after testing with the corpus for several weeks. Chapters 7-15 achieved 64% recognition with no prior testing of the corpus. The Kayhan corpus captured 60% recognition with one previous test run.
While much remains to be done towards a flawless two-level morphological engine for Persian, the existing system can prove beneficial for many different applications. PC-Kimmo was designed to be easily integrated into other software. Likewise the scripts and other tools that I have written were designed with this same goal in mind. This project can be used for Persian pedagogy, integration with search engines, corpus tools, online dictionaries, speech recognizers and generators, and machine translation programs. It is my sincere hope that this research will help open doors of opportunity and understanding for the linguistic community, as well as Farsi speakers.2
___________________________________________
1 http://home.byu.net/~jmd56
2 I would like to thank Deryle Lonsdale and the BYU Linguistics Department for their much valued support.