Jeremiah McGhee and Dr. Deryle Lonsdale, Linguistics and English Language
The use of computers to perform speech recognition on human speech is becoming commonplace in today’s world. Less well known are attempts to use speech recognition on other animal species. Current projects are researching the possibilities of using speech recognition technologies to give insight to the animal communication of dolphins1, elephants2, birds3 and even crickets4. All of these projects are showing very promising results with recognition results approaching 90%. The goal of this project is to apply speech recognition technology to identify the distinct calls of vervet monkeys.
The subjects of our research are the East African vervet monkeys (Cercopithecus aethiops), Old World monkeys and close relatives to humans diverging from a common ancestor roughly 20 million years ago. Vervets are semiterrestrial, spending nearly equal amounts of time on the ground as in trees and the majority of their social interactions happen while they are on the ground. Vervets have 3 distinct types of vocalizations; grunts, chutters and alarms. Chutters, the least understood of the calls, are soft, murmuring calls occurring during grooming and other every-day social interactions. Grunts have a more specialized purpose and are used to indicate dominance or inferiority. For instance, when an animal approaches a more dominant individual a grunt serves as a warning or alert for their approach. Grunts are also used when tracking predators and other vervet individuals as contact calls. Grunts are harsh, raspy signals and have a sound like a human clearing his throat with his mouth open. There are six distinct alarm calls falling into two sub-categories; major alarms and minor alarms. The minor alarms are classified as minor predator, baboon and unfamiliar human. The minor alarms are very quiet calls and therefore difficult to observe and quantify. Also, because they are minor alarms, the vervets display a much more reserved, or subtle, response to these alarms. Major alarms are classified as snake, eagle and leopard. These alarms are much louder calls and the vervet’s response can be easily identified. A snake alarm causes the vervets to rise up on their hind legs and survey the surrounding territory in an attempt to locate the predator. Once located the vervets will mob the snake and generally harass it until it has left the area. An eagle alarm causes the vervets to look up, again in an attempt to locate the predator (using their acute eye-sight), and enter bushes or low brush, usually leaving trees. Both the eagle and snake alarms are short, multiple syllable, cough-like alarm calls. A leopard alarm causes the vervets to enter into the trees where they can use their speed and agility to evade the predator. A leopard alarm is a loud barking call.
For our research we are using the TalkBank Ethology Corpus: Field Recordings of Vervet Monkey Call5. This data publication contains digitized audio files of field recordings of vervets collected by Robert M. Seyfarth and Dorothy L. Cheney in 1977 and 1978.
We first took the annotations provided by Seyfarth and segmented the continuous recordings into individual vocalizations removing noise and extraneous data from the signal so the semantic content can be focused on. After classifying the recordings by call-type, we then carried out acoustic analyses of the individual calls to extract those features that uniquely identified the different types of calls. For our acoustic analysis we used the freely available Praat software6. This software allows us to analyze features like pitch, frequency, intensity and quantify them for further use in our system.
After quantifying the vocalizations’ features the next step was to prepare them for the machine learning system. Again, we used a freely available system, TiMBL or Tilburg Memory Based-Learner. The TiMBL software package is a fast, decision-tree-based implementation of k-nearest neighbor classification7. In order to use TiMBL we took the previously quantified features and prepared them as vectors. This required taking Praat output files and processing them, merging together the feature quantifications such that each vector contained all the pertinent data for a single vocalization. These vectors consist of an outcome, i.e. leopard, and numbers representing the different features we extracted in Praat including pitch, frequency and intensity.
For our first major iteration on the project we created vectors containing segment level features only. The vectors contained information on average pitch, average frequency, segment length and others. This iteration had 6 possible outcomes; snake, eagle, leopard, grunt, chutter and unknown (none of the above). On this first iteration our best result achieved 69% accuracy. TiMBL also outputs information on the value of the different data points in discriminating between possible outcomes, this information showed us that many of the vocalization level features were simply not informative enough to assist the computer in distinguishing vocalizations. Our second major iteration involved breaking the vocalization into pieces to obtain more finely tuned features, and we limiting the possible outcomes to snake and eagle alarms. The second iteration’s best test achieved 83% accuracy.
We believe we can improve the system still further, and achieve accuracies of 90% or better. Extending the system to differentiate between snake, eagle and leopard alarms would be the next step; those three alarms make up nearly 70% of the corpus. At that point, rather than extending the system to handle the ‘non-semantic’ type calls (grunts and chutters), it would probably be more useful to adapt the system to distinguishing individual vervets. The corpus is annotated with the identity of the vervet vocalizing so it is a very possible extension of our research. One other possibility exists: Seyfarth and Cheney are now doing extensive research with baboons and are in the process of compiling a similar set of data. Extending this system to another species would require considerable adaptation but presents an intriguing research opportunity.
References
- Shulz, Tanja. 2003. Towards dolphin recognition. Paper presented at the third project meeting, Towards Communtication with Dolphins, West Palm Beach, Florida, March 29-30, 2003
- Clemins P.J. and M.T. Johnson, K.M. Leong, A.Savage. 2005. Automatic classification and speaker identification of African elephant (Loxodonta Africana) vocalizations. Journal of the Acoustical Society of America 117:1-8
- Fox, E.J.S., J.D. Roberts, and M. Bennamoun, 2006. Text independent speaker identification in birds. International Conference on Spoken Language Processing 2006:2122-2125
- Potamitis, I. T. Ganchev, and N. Fakotakis, 2006. Automatic acoustic identification of insects inspired by the speaker recognition paradigm. International Conference on Spoken Language Processing 2006:2126-2129
- Seyfarth, Robert M. and Dorothy L. Cheney. 2004. TalkBank Ethology Data: Field Recordings of Vervet Monkey Call. Philadelphia: Linguistic Data Consortium
- Boersma, Paul. and David Weenink, 2007. Praat: doing phonetics by computer (Version 4.5.08) http://www.praat.org (Accessed October 1, 2005)
- Van der Sloot, K. 2007. TiMBL: Tilburg Memory Based Learner (version 6.0.) API Guide. ILK Research Group Technical Report Series. http://ilk.uvt.nl/timbl (Accessed September 15, 2006)