Deryl K Hatch and Dr. Deryle Lonsdale, Linguistics and English Language
Analogical Modeling (AM) is an algorithm for systematically comparing related sets of data (‘vectors’) in search of patterns of recurrence or dominance, whether they be apparent or not, in order to predict the likely outcome of novel occurrences. As an alternative to rule-based explanations of language, for example, it allows analogy to predict behavior, instead of rules prescribing that behavior. The motivation behind this approach is the idea that the human mind learns and processes language more by analogy than through applying hierarchal rules. It has successfully modeled many tendencies in language: historical change in English, irregularities and “exceptional” behavior of Finnish grammar, and even sociolinguistic behavior via formal and familiar terms of address in Arabic.
It was Daniel Jones of the Centre for Computational Linguistics in Manchester, England who, in a brief treatise published in 1996, first proposed AM as a viable tool for translation.1 By looking for patterns in the way ideas are encapsulated in sub-sentential units, similar groupings could be found across languages, despite the surface grammatical makeup. For example, in business correspondence, the prepositional phrase (PP) “from America” is associated with only so many predicate frames, such as “import” and “export”, etc. Given the encoding of this PP in two languages, associated with its corresponding predicate frames from various contexts, a novel sentence in either language that includes a PP “from America” could be found by analogy to likewise be associated with only a certain number of predicate frames. In this way, the PP, together with the sentence’s determiner phrases (DP’s), noun phrases (NP’s) and other ‘x’- phrases (XP’s) could each be searched by analogy, each depending independently on the same predicate frame, and then recombined to create a novel sentence, built up from the sub-sentential units. Thus a translation is achieved based on the “cloning” of these individual phrasal units. My work has been to work with Dr. Lonsdale in realizing a larger scale experiment of Jones’ proposal, moving from hand-picked language samples to using existent language corpora.
The Analogical Modeling algorithm is an essentially transparent tool of data analysis. It has only a few variables to modify according to a given situation. Instead, it is the nature and structure of the data set that will determine the nature of the results. Items (like words, phrases or terms in the case of language studies) are ‘vectorized’ according to their salient features. These feature vectors can be made up of virtually any type of encoding, anything from binary to multicharacter descriptors, depending on the quantity and quality of differentiators between items. However, because the computational power needed to execute the algorithm grows exponentially according to the number of distinguishing features, even with a supercomputer, the number of features is generally limited to about 22 per vector. The number of vectors in a dataset needs to be large enough to be sufficiently representative, but not too large that it is cumbersome.
The fundamental question in applying AM to machine translation is how to assign phrasal units a unique identifier in the form of its feature vector. Our experiments were to find the best vector, or at least one that is smart enough to be able clone correctly despite its imperfections.
Jones’ proposal was limited to experiments using a small body of hand-coded, carefully selected, prepositional phrases chosen to illustrate his methodology. A large-scale realization of his experiment would preclude hand-coding and it would require arbitrarily-chosen phrases. The choice of a language corpus was not difficult: the BABEL corpus of business correspondence provided an appropriately-sized, pre-translated set of data to allow experimentation.2 To initially analyze our selected corpus, Dr. Lonsdale proposed using the Link Grammar Parser (developed at Carnegie Mellon University) because of its robustness and its high level of description of the syntactic inventory of sentences.3 These parses became the first step in all of our subsequent experiments.
In our first experiments the parsed phrases were processed using a Perl program that dissects them by extracting the verb and then making an inventory of each one’s syntactic content. Initially, we followed a binary-style of encoding as proposed by Jones with marginal results. The sparseness of distinguishing features, and interference form anomalous vectors associated with auxiliary verbs, made no clear prediction possible. The number of possible outcomes in the ‘analogical set’ was unwieldy with no clear frontrunner percentage-wise. Binary inventories were not enough to distinguish among so many types of XP’s. We needed more description.
I suggested to Dr. Lonsdale that instead of simply describing the phrases in binary terms of presence or absence, that we encode the vectors with the name of the ‘node’ of the sentential unit itself. He made the necessary modifications to the Perl vector builder and in our second round of experiments, the Link Grammar Parser’s fine-grain description, provided differentiators rich enough to produce notable success: the correct outcome was consistently in the top 5 of probable outcomes for each test vector, and the analogical set of all probabilities was reduced by over 80%. This was encouraging progress, but though they were more precise, the problem we noticed now was that the vectors often were describing entire sentences, made up of any number of NP’s, PP’s etc. and even subordinate verb phrases (VP’s), thus lacking the proper division at the ‘border’ of these predicate frames. These units were clearly too large to eventually clone cross-linguistically. We needed to parse more locally.
Again, the Link Grammar Parser provided the solution. Dr. Lonsdale modified the Perl code yet again to dissect the parses at a finer-grained level. Instead of only inventorying the surface structure of the sentences, the new vector builder scanned the inner-groupings of the sentences looking for critical XP’s where it could be dissected. Now, not only the verb became associated with a sub-sentential unit, but the verb and its relation to the unit as well. This approximated the use of thematic roles as originally proposed by Jones. In this third round of experiments, the correct outcome was consistently ranked at the number one position of probabilities (85% of all test vectors), with significantly more probabilistic difference distinguishing them from among the analogical set. We had successfully cloned the XP’s, at least within the same language. The next step was to see if we could clone XP’s from another corpus against our dataset of vectors.
Because we do not yet have a parser for another language that parallels that which we have for English, our attempts to create a set of parallel vectors have not been encouraging. Because of this current limitation, we thought we might instead test the system against different corpora in English, to at least have a reference to be able to then measure future experiments in other languages. But we have not yet found a suitable corpus that does not differentiate too much in linguistic style from that of business correspondence so as to give reliable, pertinent results.
Insofar as we have made significant progress in testing the scalability of Jones’ proposals I am happy to have mostly achieved the aims of my original proposal. But in order to continue we will need to (1) further perfect Dr. Lonsdale’s French link grammar parser, or find a suitable substitute for a target language and (2) develop the recombinatory mechanisms to go from probable clones to complete sentences. To further ensure accuracy, we have discussed ways to encode more semantic content into our vectorization process as well. In this way we hope to finally achieve an exemplar-based translation though phrasal cloning.
1 Jones, Daniel. Analogical Natural Language Processing. London: UCL Press, 1996.
2 BABEL business correspondence corpus: European Corpus Initiative Multilingual Corpus 1.
3 http://www.link.cs.cmu.edu/link/