Japanese Sentence Parsing in an Artificial Intelligence System

Nathan Glenn and Dr. Deryle Lonsdale, Linguistics

Abstract

In this project, we created a Japanese sentence parser that functions within a cognitive modeling architecture known as Soar. It was created by modifying an existing English parser called XNL-Soar, which implements minimalist principles in syntactic parsing. Japanese lexical access is performed via GoSen and a Java interface to the Japanese WordNet, and syntactic and semantic processing is done within Soar. The ease with which the parser was modified to process Japanese input is extremely encouraging and lends support to its use as model of human sentence processing, as well as to the Minimalist Program in linguistics. The system is in its beta stage and we plan to continue development.

Background

Soar is a problem-solving architecture meant to model all aspects of human cognition (Newell, 1990). Given a set of rules of behavior, Soar will go about solving a given problem in a way similar to humans. This allows researchers to test a theory of behavior by specifying it within Soar and then observing Soar’s operation. XNLSoar is a language processing module implemented in Soar, and it is intended to model sentence processing in humans (Lonsdale, 2003). It employs principles of Minimalism (Chomsky, 1995) in syntactic processing, Lexical Conceptual Structure (Jackendoff, 1992) in semantic processing, and accesses WordNet (Felbaum, 1998) for lexical information. XNL-Soar has theoretical and practical uses: it allows empirical testing of the Minimalist Program and various ideas about human sentence processing, and it can provide a human-computer interface for any application.

Project Description

In this project, we created another version of XNL-Soar to parse Japanese input, called JXNL-Soar. Japanese syntax is significantly different from that of English, so it provided an opportunity to experiment with the various parameters used in Minimalist theory and implemented in XNL-Soar. The basic questions that the project sought to answer were how well current Minimalist theory accounts for Japanese syntax (and for languages in general), and what the unique architectural characteristics of Japanese sentence processing are.

The project consisted of two main parts: lexical access and syntactic processing. (J)XNL-Soar does not attempt to model lexical access, which is considered to be its own process; therefore, while syntactic processing is done within Soar, lexical access is done in a separate application.

Lexical access involves two processes: parsing the input into separate words (tokenizing), and retrieving information about each word. Most of this work was done with a tool called GoSen (Francis, 2010). When a Japanese sentence is entered into the system, the sentence is first passed to GoSen, which tokenizes it and provides information such as part of speech, conjugation, and lemma (dictionary form of the word). During later processing, JXNL-Soar queries another tool, Japanese WordNet. Japanese WordNet is a large electronic dictionary which contains word information organized by its meaning (sense) (Isahara, et. al., 2008). It also provides sense IDs which can be matched with senses in the English WordNet, and this information will be utilized in the future for modeling translation. It also provides semantic classes (animal, artifact, body verb, etc.) which can be used to create semantic sentence models.

Syntactic processing is modeled within Soar, which uses reasoning and various sources of knowledge to build a complete sentence model in a manner similar to human sentence processing. Soar is based on the hypothesis that purposeful behavior can be represented as operator selection and application to a state (a state is a representation of the current problem-solving situation; an operator is an action taken to achieve a goal). The operators in (J)XNL-Soar which build linguistic structure are:

• sentence2inputlink- “hears” one word from the sentence

• getword- Accesses part of speech, lemma and sense information

• graft- connects syntactic structures (also known as “merge”)

• adjoin- extends and connects linguistic structures; used for modifiers

• project- creates hierarchical structure as stipulated by the hierarchy of projections

These operations are posited to be used cross-linguistically, and both the English and Japanese versions of XNL-Soar use them.

Though the system is only in its beta stage and is not yet fully developed, the ease with which it was modified to account for a certain level of Japanese syntax is a promising finding for research in human sentence processing. 3 types of changes were made:

1. Syntax-local lexemes. Pronouns, determiners and prepositions for Japanese needed to be coded directly into the system; these are theorized to exist within the syntactic processor instead of in the lexicon.

2. Left-branching: Japanese complements come to the left, instead of the right as in English. This facilitated a minor change to the operation of one of the graft operators. This difference is well known and supported by psycholinguistic experimentation.

3. bottom-up processing to accommodate the head-final nature of Japanese. To understand number 3 better, we offer an example:

• The man who is being eaten (DET noun CP PROG PASS verb)

• (verb PASS PROG [null CP] noun)

Because English is head final, the head of the CP, “who”, is processed at the beginning, followed by the progressive auxiliary, the passive auxiliary, and the verb. In Japanese, it is the opposite: first the verb, then the passive auxiliary, then the progressive, a null complementizer, and then “man”. In English, when a progressive auxiliary is encountered, we know that there is no chance of encountering a perfective auxiliary, modal verb or higher clause later on in the sentence; in Japanese, there is no guarantee of that. Therefore, while in English parsing a progressive particle can project all the way up to (existent or non-existent) TP and CP, in Japanese, projection must wait until after later words are processed. In XNL-Soar, this is simply a change of operator preferences. Operator preferences determine the priority of operator execution. To parse Japanese instead of English, getword had to be given priority over hop when licensed by morphology (non-final forms signal morphemes to follow).

Theoretical Implications and Future Work

Mazuka (1998) argues for a similar method to efficiently parse Japanese in a cognitively plausible manner. She argues that there exists a parameter in Universal Grammar which causes parsing to be done in a top-down or bottom-up manner, and that this accounts for efficiency in human parsing across left-branching and rightbranching languages. She presents data from both child and adult speakers and shows that this hypothesis correctly predicts the parsing difficulty of various types of sentences. So far, our work in modeling human sentence processing in Japanese and English has confirmed her work.

Richard Lewis (1993) also listed types of unproblematic ambiguities in Japanese and applied the theory of NL-Soar (XNL-Soar’s GB-based predecessor) to them, using his assigners-receivers set to explain the data. However, he did not do any empirical testing of his theory. Now that we have a working Japanese parser available, we would like to further test Lewis’ hypotheses about Soar and Mazuka’s hypotheses about Japanese sentence processing. In the future, as we continue to develop JXNLSoar, we plan to test its behavior when given both garden-path sentences and ambiguities which are not problematic in humans.

Example Walkthrough

Below is an outline of JXNL-Soar’s operation while parsing the following sentence: chiisai inu-ga tabemono-o taberareteinakatta

small dog-NOM food-ACC eat-PASS-PROG-NEG-PAST

(“The small dog was not having its food be eaten”)

Each line represents one operator. The resulting syntactic tree is shown in figure 1.

References

Chomsky, N. (1995). The Minimalist Program. MIT Press.
Fellbaum, C. (1998). WordNet: An Electronic Lexical Database. MIT Press.
Francis, M. J. (2010). GoSen. Retrieved from litadaki.org/wiki/index.php/GoSen.
Isahara, H., Bond, F., Uchimoto, K., Utiyama, M., & Kanzaki, K. (2008). Development of the Japanese WordNet. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odjik, S.
Piperidis, & D. Tapias (Eds.), Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Association (ELRA).
Jackendoff, R. (1992). Semantic Structures. MIT Press.
Lewis, R. (1993). An Architecturally-based Theory of Human Sentence Comprehension. Carnegie Mellon University.
Lonsdale, D. (2003, June). Progress on NL-Soar, and Introducing XNL-Soar. University of Michigan.
Mazuka, R. (1998). The development of language processing strategies: a cross-linguistic study between Japanese and English. Lawrence Erlbaum Associates.
Newell, A. (1990). Unified Theories of Cognition. Harvard University Press.

Brigham Young University

Journal of Undergraduate Research