Eric Roundy and Dr. Deryle Lonsdale, Linguistics
The first step in meeting the goals outlined in my proposal was to download the source code for the Festival speech synthesis engine. Festival is an open source modular synthesis engine built on a Scheme backbone. I ported the make files to run with my current version of the GNU compiler collection (GCC) compiler and built and installed the engine.
Next we wrote wrappers in Tcl and Java to provide an abstract high level interface for third party code. These wrappers ran Festival as a child process. Then utilizing Festival’s client/server API and Pipe interface, they used Scheme commands to exert low level control over the program. Initially these wrappers were written to take straight text from the users and return the synthesized speech in the form of WAVE files. Later implementations were expanded to take text that had been marked using an XML based mark-up language called Sable.
After the wrappers were in place, we were able to write scripts which extracted then synthesized two party conversations from the 2001 Communicator Corpus, reaching our first milestone. These scripts worked by generating and synthesizing Sable marked text. This example was implemented with the rather limited set of Sable tags already supported by Festival. In the future we want to expand the breadth and quality of the tags supported. Particularly, we want better control over intonation, duration, and speech rate.
We also used the Festival wrappers to provide speech synthesis for a round trip conversation between a robot and a human. In this conversation the human and robot communicated across a hard-coded TCP connection.
The second and third milestones required a more versatile communication network which would allow for more scalable conversations. We decided to use a star topography with a central server which would be responsible for facilitating communication between all of the clients in a conversation. Clients would feed the server audio in the format specified by the server, then the server would multicast the merged audio. Java was chosen to implement the server to allow more trouble free integration into existing third party software.
In our first implementation we tried to stream audio across sockets using the TCP protocol. However the overhead associated with this approach gave results that were simply unacceptable in what was supposed to be a real time application. In subsequent implementations we used a protocol better suited to streaming data, UDP. This produced very good results once the timing of the threads was tuned properly. Unfortunately, on some machines a small amount of static was introduced into the incoming audio signal because we were unwilling to buffer the audio. This would have increased the lag between when an utterance was spoken, and when it was heard.
A disadvantage of using Java to implement our server was that it does not natively provide a software mixer. This problem was further compounded by our inability to locate a third party mixer with an acceptable license. So we built our own mixer capable of merging the incoming audio. The lines of our mixer drew their input from the UDP ports that the clients were sending their data to. These lines read and buffered all of the data sent across the connection. The mixing itself was preformed by summing the next frame of data from each of the available lines. These frames were then inserted into new packages and multicast.
Since our purpose was to provide an architecture which facilitated dynamic conversations involving both human and automated components, we needed separate clients which could handle both types of users. The human client we built simply captured a microphone and fed the audio to the server, then played the audio generated by the server. The synthetic client took marked up text and passed it to Festival, then converted the generated WAVE files to the appropriate audio format and fed this to the server. Both clients took the address of the composite multicast from the server and either played it locally or passed the address on to another component which would.
This project shows that Text to Speech can be effectively extended for dynamic multi-party conversations by building on existing unrelated technologies. The modular nature of this approach will allow for more work to be done in the future. The most fruitful areas for development will be in Festival itself. We will need to add new Sable tags and work to make existing tags more efficient. As the support of mark-up languages improves we will be able to more effectively move the decision making process into external cognitive components, providing speech synthesis capabilities which are more variable, flexible, and effective. There is also more that needs to be done to the mixing component of this software. First, Java is not the ideal language for what should be a high speed precision component. A more efficient module could be written in C or C++. Second, the current mixing component only supports signed PCM audio formats, future implementations should be extended to use a wider array of audio encoding. The ideal solution would be to find an existing module whose license would allow for its integration into this project.
Through the course of my research, I have learned a great deal about low level digital sound manipulation and audio streaming. I have had an opportunity to learn Tcl/tk and improve my Scheme and Java programming skills. Additionally, I have learned a great deal about synthetic speech and have even been able to participate in some of the dialogue surrounding the technology. I took part in presentations at the Deseret Language Learning Symposium (DLLS) and The Foreign Language Education And Technology Conference (FLEAT 5). More importantly, I am now better equipped to formulate project ideas, plan an approach, teach myself new skills and programming languages, and ultimately carry through with set short and long term goals. I feel that this has been an excellent opportunity for me as an individual.