Semantic Class Disambiguation in Natural Language Soar

Anton Rytting and Dr. Deryle Lonsdale, Linguistics

Soar1 is a unified theory of human cognition implemented as a computer program. It posits that the interaction of a few fundamental mechanisms can account for any aspect of cognition. The theory’s computer implementation allows the creation of testable computer simulations of various cognitive tasks.

Natural-Language Soar (NL-Soar) is a specialization of this theory designed to simulate certain aspects of human language comprehension and production. In addition to the assumptions of the Soar paradigm, the NL-Soar model is constrained by several human language-processing limits discovered through psycholinguistic experiments. Its goal is to explain as many aspects of language as possible in the context of general cognition. This will eventually allow integration of natural language in more general models of human behavior within the Soar paradigm.

Although NL-Soar contains the infrastructure for a principled lexicon, a robust large-scale lexicon was never implemented. The first goal of this project is to address that gap, by linking NL-Soar with a preexisting lexicon that meets these criteria. WordNet,2 perhaps the largest freely available lexicon built on psycholinguistic principles, was chosen to simulate the lexicon within NL-Soar. WordNet defines over 91,000 concepts in American English and links them to sets of synonyms which express those concepts. It also lists (in order of frequency) all the meanings or senses of each polysemous word in its word list. In addition, it links all these words and concepts into semantic structures such as hierarchical ontologies and antonym pairs.

In such a complete lexicon as WordNet, nearly every word is polysemous. In order to properly understand and model the semantics of a sentence, the correct meaning of each word must be distinguished. Hence, the second goal of this project is to decide which meaning of a given word is most appropriate in the context of the sentence in which it appears. This task, commonly called word sense disambiguation (WSD), is considered a difficult task within the natural language processing community, particularly when distinguishing between fine shades of meaning like those found in WordNet’s senses. Furthermore, psycholinguistic data suggest that WordNet’s senses may actually be more fine-grained than the distinctions in most people’s mental lexicon.3

In accordance with these data, I have chosen to model a coarse-grain disambiguation task here called semantic class disambiguation (SCD), using 45 general semantic classes taken from natural divisions in the WordNet database — 26 for nouns, 15 for verbs, and four for other parts of speech. SCD of nouns and verbs has been modeled using a few key aspects of the linguistic context B morphology, syntax, and word-class-based semantics B within the bounds one sentence. Although other contextual clues B prosody, word stress, discourse, and extra-sentential context, to name a few B are also important factors, they are outside the scope of this project.

WordNet provides a wealth of useful data for the SCD task, including baseline frequency rankings for each word sense. WordNet lacks direct collocational data to show which semantic-class pairings make sense and which do not, but a portion of the Brown corpus4 annotated for word senses provides a data source for discovering likely pairings of semantic classes within specific semantic relationships (subjectverb, verb-object, etc.). Five of the fifteen verb classes were sampled to find common noun-class pairings for external (subject) and internal (direct object) relationships. Pairings which accounted for more than 5% of the total sample were considered canonical pairings, and preferred to other pairings through semantic collocation constraints.

When NL-Soar “hears” a sentence, it receives each word one at a time, just as a person would. For each verb-noun pair in the sentence, the most likely grammatical relationship between them is determined, and an appropriate semantic relationship is then chosen. All combinations of possible semantic-class pairings are considered; those semantic classes corresponding to the most frequent word senses are tried first. The first canonical pairing (i.e., the first pairing to pass the semantic collocation constraints) is accepted and fitted into the sentence=s semantic model.

In order to test this method of building a semantic model, ten sentences containing verbs from the semantic class v-body were collected from the word-sense tagged Brown corpus. These sentences were simplified syntactically, and all pronouns were replaced with nouns from the appropriate semantic class. NL-Soar was then given these ten simplified sentences, and its predictions were compared with the original word-sense tags. This preliminary testing resulted in exact matches for three of the ten sentences, and plausible answers for two more sentences. NL-Soar gave implausible responses for two of the sentences, and failed to process three of the sentences for reasons unrelated to the semantic processing module, mostly deriving from imperfections in the syntax module. These results are inconclusive, and further testing is necessary to determine the effectiveness of this approach once the semantic collocation data is collected from all fifteen verb classes and interference from the syntactic module is eliminated.

Meanwhile, NL-Soar=s model of SCD raises an intriguing question for psycholinguistic study. As mentioned above, the semantic classes for each of the two words in a pair are tested in order of wordsense frequency. But for which of the two words does word-sense frequency matter more? The semantic model currently privileges the word which appears first in the sentence, but this assumption is one of several possibilities whose validity should be tested empirically. This question — the role of word order and word-sense frequency in sentences with multiple polysemous words — may be a promising topic for further study in my upcoming psycholinguistics course at The Ohio State University.

References

Newell, A. (1990). Unified theories of cognition. Cambridge, Mass. Harvard UP.
Miller, G. A. (1990). WordNet: An onnline lexical database. International Journal of Lexicography, 3(4). (Special Issue).
Williams, J. N. (1992.) Processing polysemous words in context: Evidence for interrelated meanings. Journal of Psychological Research, 21:193-218.
Francis, W. N., & Kuera, H. 1982. Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin.

Brigham Young University

Journal of Undergraduate Research

Semantic Class Disambiguation in Natural Language Soar

Anton Rytting and Dr. Deryle Lonsdale, Linguistics

References