Visual Display of Speech Spectra as an Aid to Lip Reading

Steven L. Tait, Jr. and Dr. William J. Strong, Physics and Astronomy

Introduction

Many sound pairs in the English language appear ambiguous to lip readers, such as /pa/ and /ba/ or /ki/ and /gi/. In previous studies, a lip reading aid was developed to overcome this obstacle by processing sound signals and displaying them as polygons (see Hunter, 1997 and Tait, 2000). We extended this study in the course of the research for my Honors Thesis. Different display formats were experimented with and the potential of a visual display of speech spectra was demonstrated. That research has been submitted to the Journal of the Acoustical Society of America for publication.

In that research, we tested the use of a “pie slice” format and found that subjects preferred that display format to the previously used decagon format. An example of a speech spectra represented in this “pie slice” format is shown in Figure 1, as well as the decagon format which was used in the previous studies. For further information on how the speech frequency spectra are calculated and how the graphical representations are generated, the reader is referred to the Honors Thesis of the author, available in the BYU library (Tait, 2000).

Procedure

The present research focused on two key objectives that we wished to explore further. In our previous research we noticed a learning trend among the test subjects. Subjects took four tests, each with a different display format. The order of the tests was randomized for the subjects, but it was noted that, on average, subjects performed 5 percent better on the second test they took than on the first, and another 5 percent better on the third test. In this study, subjects took the same test three times, allowing us to test for learning when subjects used only one display format.

To this point in the research, all of our testing had been done with one recorded set of the 18 sound pairs, used in testing. We tested the display using different utterances of the sounds by the same speaker. The purpose was to see if the visual representations of different utterances of the same sound looked more alike than the representations of two different sounds. The 18 sound pairs used in testing comprised 22 different syllables. We therefore tested 18 “different” pairs and 22 “same” pairs, where all of the “same” pairs were made up of two different utterances of the same syllable. The order in which the pairs were presented to each subject were randomized, using a computer program which ensured that the correlation between the test order of any two subjects was minimal.

Subjects were shown pairs of graphical representations of sound sequences and asked to determine whether the two were the same. Each subject took the same test three times. Test scores were averaged for each subject=s first test, each subject=s second test, and each subject=s first test, so that learning could be observed. Responses specific to the various syllable pairs were also tracked in order to accomplish the second objective of the study.

Test Results and Discussion

The learning trend was not as visible in this study as we would have hoped. The average test score for the first test taken was 80%, for the second test it was 86%, and for the third test it was 87%. The increase in test score from the first to the second test indicates that subjects do require some time to become accustomed to the rapid changes that take place in the display. However, we would have hoped to see a similar increase from the second to the third test taken. As noted in the previous study, test subjects may have become bored with the test. In the prior study, subjects took four tests, each with a different display. Scores increased during from the first to the second and second to the third tests taken, but leveled off on the fourth test. We concluded that this may have been due to the fact that the subjects were not asked to interpret any meaningful information, but just to compare shapes (Tait, 2000). The same may have been true in this study, but it is also possible that limitations of human response time and perception were reached.

The pair-by-pair results were very positive. On average, subjects identified the same pairs to be the same 90% of the time. They incorrectly identified representations of different syllables to be the same 23% of the time. We were pleased with this result since it indicates that different renditions of the same sound do look much more similar in the visual representation of our display, than do representations of different sounds.

Pair 12, /da/ and /na/ was a pair that gave us some difficulty in our prior testing. Analysis of the spectrograms of the rendition of /da/ used in the prior research revealed unusual voicing present which made it quite similar to the /na/ syllable. Hence, discrimination scores for this pair were very low (Tait, 2000). It was then hypothesized that different renditions of the syllable /da/ would produce better results. This hypothesis was confirmed in our testing. The average score for the different /da/-/na/ pairs was 77% in the present testing, which used different renditions of both /da/ and /na/. This was much better than the 65% correct score in the previous study which averaged in both same and different pair responses.

Conclusion

It is hoped that this research will someday lead to the development of visual aids for lip readers. The frequency spectra of different sounds are very distinct, even for sounds which may be visually ambiguous to a lip reader. A simplified display of these spectra, as was tested in this study, shows some potential as being an effective aid for lip readers in interpreting speech.

References

Hunter, E. J. “Geometrical display of speech spectra as an aid to lipreading,” M.S. Thesis, Brigham Young University. 1997.
Tait, S. L. Jr. “A Comparative Study of Various Display Formats of Speech Frequency Spectra to Aid Lip Readers.” Honors Thesis, Brigham Young University. 2000.