Jonathan Dehdari and Dr. Deryle Lonsdale, Linguistics
Soar is a general cognitive architecture for developing systems that exhibit intelligent behavior. Researchers all over the world, both from the fields of artificial intelligence and cognitive science, are using Soar for a variety of tasks. Psychologists, AI game developers, linguists, and others interested in cognitive modeling use and develop this open architecture. LG-Soar is a project within the general Soar framework that deals with the processing and modeling of natural languages, using the Link Grammar syntactic parser. LG-Soar has been excellent in its ability to accurately and robustly parse English sentences, but the ability to handle morphology, or the individual pieces of a word, had not been adequately addressed. Workarounds had been pieced together for LG-Soar’s handling of English, but a major limitation existed for the internationalization of Soar without a morphological parser. My research addressed this issue by developing an interface to integrate a morphological parser into LG-Soar.
The morphological parser uses two-level finite state automata to determine correct boundaries between parts of a word and accordingly divides these morphemes. These morphemes and their boundaries are then transfered to the link grammar parser for syntactic analysis. Morpheme boundaries, ‘+’, are of little use to the syntactic parser, so they become word boundaries, which are white-spaces under the current implementation.
The PC-Kimmo morphological engine currently does not handle non-Romanized alphabets, so one of the first hurdles was to devise a direct correspondence between the native Arabic alphabet and a Romanized one. The test corpora was encoded in various character sets, including ISIRI, ISO, and Unicode. I wrote Perl scripts to translate among the alphabets and character sets.
Due to the the relatively light use of morphology in the English language, I developed a Persian (Farsi) link grammar in 2003 and a two level morphology engine in 2002. Since this was the first fully developed link grammar of a language other than English, certain challenges had to be addressed along the way. Romanization and morpheme boundaries had already been worked out, but the question of directionality remained. Link grammar syntax has no hierarchy, but simply links between categories of words. Thus in English a verb looks to its left for a subject, and looks to its right for an optional object. Conversely a noun looks to its right to complete the subject link, or to its left to complete the object link. But Persian syntax is ‘Subject Object Verb’ (SOV), so the verb looks to the same direction for both subject and object links. Likewise subjects and objects look to the right to form their respective links. The challenge came with how the verb would know which link to make with a given noun. I subsequently discovered that distance can be used in forming links, in addition to direction. So the verb could look for a noun to its left to form a subject link, and then look for another optional noun even closer to the verb to form an object link. While oversimplified, this description adequately explains the general concept.
Future work remains for a more expanded lexicon, in addition to better morphological and syntactic coverage. The character set converter may also be integrated, removing the need for pre-parsing. Structures for semantic and discourse representation may be added in future releases as well. This would allow full integration from the morphophonemic level to the discourse level.
This project has helped the Artificial Intelligence research community expand beyond English in a general sense with the integration of the two-level engine, as well as having a specific language implementation to evaluate the system. The project has also helped the author better understand two-level morphology and link grammar syntax.