Synonymity of Memorized Patterns: Considering Patterns and Word Order in Unsupervised Language Learning

Steve Hulet and Dr. Sean Warnick, Computer Science

Despite the ever-increasing abilities of computers, natural language analysis is still a challenge. The intricacies of natural language are far too many to enumerate, giving rise to automated algorithms which learn how the language is used from large text corpora. Many current methods use complex statistical approaches involving multi-dimensional vectors and factor analysis, and admittedly fail to take into account valuable contextual information such as word order. This project continued to develop a novel, simple, unsupervised learning approach for determining word similarity based on context in reoccurring word sequences; previous work confirms that similar words are used in similar contexts. The algorithm reads raw, untagged text, recording repeated word-use patterns and their frequency. These patterns preserve the context of words by remembering where each word appears in relation to those surrounding it. By examining the context of a target word, other words or phrases with similar meaning can be found within similar contexts.

Our learning model was developed in cooperation with Sandia National Laboratories. The underlying algorithm is known as S-learning; applied to natural language processing the method is known as Synonymity of Memorized Patterns (SMP). While our implementation of the process was complete and preliminary results received, we lacked formal methods to compare the performance of our learner with previously exiting learners. Developing such methods was the goal of this project. We proposed to develop a mathematical formulation of SMP to allow for easier critique, comprehension, and comparison of the algorithm. Our plans to compare head-to-head SMP with Latent Semantic Analysis (LSA), a well known statistically based word-similarity prediction algorithm, were foiled when the BYU Computer Science Department removed critical software from their lab machines halfway through the development process.

We formulated the synonym detection problem thus: given a large unparsed text corpus and a target word, automatically find synonyms of the target word and order them according to relevancy. The concept of word similarity is somewhat subjective, but testing methods do exist, and in this application intuition is often a reasonable guide.

We model natural language as a Markov process. Each state S1, S2, …, Sn represents a unique sequence of k words; thus for a language of w words there are wk possible states. Each edge from Si to Sj has transition probability pi(j) = P(Sj |Si) and generates the word wj when taken. The sequence of words a state represents serves to remember the k words most recently seen prior to arriving at that state. With such a model it is possible to calculate the probability that any set of words will appear in any given sequence. To do this we find a path following the edges which generate the desired word sequence, and multiply the transition probabilities. For the string wi,wi+1, . . . ,wi+n, we first find the state Sj which represents the first k words of the string. We then follow the edges which generate 1 the remaining words in the string, multiplying the transition probabilities, until we end in the state representing the last k words of the string. A string of n words is represented by n − k + 1 states. Thus the probability of any arbitrary path [Sj , Sj+1, . . . , Sj+(n−k)] can be expressed by

One common measure of determining word similarity is the degree to which one word can be substituted for another, that is, the extent to which two words can be found in identical contexts. A context is defined as the states in the path on either side of a word. For convenience in notation, the state immediately preceding the word of interest is labeled the initial state, S0, and the state immediately following the word of interest is labeled the final state, Sf . The set of words that can be substituted into a given context can be found by identifying alternate paths with nonzero path probabilities between the initial and final states. Including the initial and final states, the length of the path between S0 and Sf is k + 2 states. Thus, for a context S0 and Sf , where Sf = wf1 ,wf2 , . . . ,wfk , the “replacement set” of words is given by words for which the path probability P > 0 and is given by P(wj ,wf1 , . . . ,wfk |Si). That is, we find probability of that phrase occurring for every possible word wj .

Synonyms for a specific word ws are obtained by finding all contexts in which ws occurs, calculating the replacement set for each of those contexts, and combining the replacement sets into a synonym set. Factoring in the frequency with which both ws and each replacement word appears in a context allows for ordering of the synonym set. All words ranking high in such an ordering may be said to be similar to each other.

When k = 1 the Markov model becomes first order. This model has the property that when calculating the occurrence probabilities of state paths that differ by a single word wj , the paths transversing the state space are identical through wj−1 and are again identical starting with wj+1. First order models allow only a limited sense of context; only the current word influences the decision of what word will follow. To allow more of the context to affect each transition, we increase the order of the model. Each transition probability in an nth order model is influenced by the preceding n words. Since word selection in natural language is not based solely on the word immediately preceding, but on a larger context, a model with a larger k seems more appropriate.

Given such a model, knowing more context around a target word ws will help us find better synonyms of ws. Suppose we are looking for synonyms of wb in the string wa,wb,wc. We may learn that P(wa,wi,wc) = P(wa,wj ,wc), and P(wa,wi,wc) > P(wa,wk,wc) for all k “= i, j, thus providing no information as to which is the best synonym. If we are then able to learn that the string we are interested in, wa,wb,wc, comes from the greater context wz,wa,wb,wc,wd, this provides additional information. If P(wz,wa,wi,wc,wd) “= P(wz,wa,wj ,wc,wd) then one of the candidates can be judged a better synonym than the other. If P(wz,wa,wi,wc,wd) is high and P(wz,wa,wj ,wc,wd) is low, than wi is the clear winner. This simply indicates that the substring wa,wj ,wc is common within some other context. In the case P(wz,wa,wi,wc,wd) = P(wz,wa,wj ,wc,wd), then this model is affirming the equal probability of wi and wj .

This model is still incomplete and suffers from a number of flaws. For example, there are problems when trying to analyze strings with fewer than k words—such a sequence looses probabilistic meaning in this model. A large question remains in determining the optimal order to use when modeling natural language. To circumvent this, as well as to optimally use all available information in any situation, an interesting line of future research may include multi-order Markov models, which compute and then combine results from n jth order Markov models, 1 # j # n. 2

Brigham Young University

Journal of Undergraduate Research

Synonymity of Memorized Patterns: Considering Patterns and Word Order in Unsupervised Language Learning

Steve Hulet and Dr. Sean Warnick, Computer Science