Nicholas A. Stetich and Dr. Deryle Lonsdale, Linguistics and English Language
Although it is often repeated, perhaps to the point of becoming cliché, we live in an age of information. Information is readily available, and in a variety of formats, especially electronic formats. Our ability to digitize data, or make it available for computer access, has increased dramatically over the last several decades, due to the exponential growth of computers’ processing power and storage capabilities. However, much of this data is unorganized, and exists in many formats (text, audio, video), and in many different languages. Expert linguists can often identify what language is being spoken in a particular piece of audio data, but human time is expensive and even experts are often error-prone. Using machines to automate the process of spoken language identification would be a tremendous boon, especially in cases where large numbers of language samples must be identified.
In previous research conducted at BYU, Dr. Deryle Lonsdale successfully used the Analogical Modeling (AM) system, an exemplar-based system developed by another BYU faculty member, Royal Skouseni, to identify textual samples of various languages. We chose to test AM’s capabilities at distinguishing spoken samples of language, and to thus prototype a successful automated spoken language identification system.
AM is a system that is capable of “learning” from training data, then applying the relationships that it has observed to new data, and determining what category the new data falls into. The difficult part of using AM to identify data is determining which features of the data to give AM to use for identification. For voice samples, this could include a variety of characteristics like pitch, formant frequencies, intensity, or even any combination of these characteristics.
For spoken sample data, we used the OGI Multilanguage Corpus, a corpus containing data from telephone calls of speakers of eleven different languages. We used the phonetic analysis program Praatii to extract data from the telephone calls in 15ms windows. We felt the first four formant frequencies would probably be a good indicator of what language was being spoken— the formant frequencies are determined by the placement of the articulatory organs in the mouth and throat, and generally indicate what vowels or consonants are being produced. Since some sounds are particular to certain languages, we felt this would at least be a good place to start.
Each feature vector we gave to AM included six 15ms windows of speech data. Because AM relates categories well but has no sense of the closeness of individual numbers, we placed formant frequencies in bands or categories by using a rounding function that rounded each frequency to the nearest 100. Here is an example of a feature vector describing six 15ms windows from a sample of spoken English:
e, 600 600 2300 3000 700 700 2300 3200 400 400 2400 3100 400 400 2300 2900 500 500 2300 3000 500 500 2300 2900 ,
The results were promising. In a test involving 4,318 90 ms feature vectors of speech samples (1,400 English, 1,536 Mandarin, and 1,382 Tamil vectors) the AM system was able to correctly identify which language was being spoken 62.92% of the time. It was particularly successful at identifying English; 75.07% of the English samples in the test data were correctly identified. In another test between German, Farsi, and English samples, the system chose the correct language 64.22% of the time. Although 64% might leave something to be desired as a student’s grade on a final exam, it is certainly an improvement on the 33% baseline the system would have gotten had its decisions been totally random. Actually, our results were not unfavorable; in previous research we studied, automated systems would typically rank between 65% and 75% correct recognition of spoken samples of language.
To get an idea of how AM’s results compared to other similar systems we executed a test with TiMBLiii, a memory-based learning system based on the k-nearest-neighbor algorithm. In this test, we used the same English, Mandarin, and Tamil learning and test data as in the earlier test with AM. TiMBL actually did not fare quite as well as AM under the same circumstances: it guessed correctly 55.41% of the time. However, TiMBL was still helpful in our research; as part of its evaluation of its training data, TiMBL indicates which features in the vectors were most helpful for distinguishing between different outcomes. When trained on our data, TiMBL ranked the first feature of each of the six windows (F0 or the fundamental frequency of the voice) as the most important feature, and the second and third formants as the next most important features. This might possibly be helpful for future iterations of AM testing—for example, a feature vector could include eight 15ms windows with only formants 0, 2, and 3.
There is yet much research to be done on the use of Analogical Modeling in spoken sample language identification. My research has really just chipped the tip off the iceberg. I would like to perform further classification of the data we used to train the AM engine, perhaps sorting data sets to see whether the system performs better with female or male speakers (as the formant values will be different for both), and looking at what individual sound segments are being uttered in the samples that AM checks; this might aid us to discover what sounds are most influential in inducing the system to pick one language or another. Another possibility would certainly be to find a way to show the transitions between sound segments in the training data, since these transitions should be a very good indicator of combinations of sounds in a language.
This project has been a valuable experience for me. We have shown that AM can indeed be useful at distinguishing not only between the written form of languages, but the spoken form as well. I believe that further research in this area will increase the accuracy of the system, and could possibly contribute significantly to solving the problem of automated language identification. ORCA funding has given me the means to investigate a state-of-the art speech processing problem, study machine learning approaches, and develop programs and data sets that illustrate possible solutions.
References
- Skousen, Royal. Analogical Modeling of Language. Dordrecht: Kluwer Academic Publishers, 1989.
- See www.praat.org .
- See http://ilk.kub.nl/software.html .