AUTOMATIC SPEECH RECOGNITION
at CSLU

What Is ASR  |  Tools  |  Research  |  Past Projects  |  Current and Ongoing Projects  |  People  |  Publications  |  Education  |  Links  |  Sponsors


What is Automatic Speech Recognition?
Automatic Speech Recognition (ASR) is technology that allows a computer to identify the words that a person speaks into a microphone or telephone. The "holy grail" of ASR research is to allow a computer to recognize in real-time with 100% accuracy all words that are intelligibly spoken by any person, independent of vocabulary size, noise, speaker characteristics and accent, or channel conditions. Despite several decades of research in this area, accuracy greater than 90% is only attained when the task is constrained in some way. Depending on how the task is constrained, different levels of performance can be attained; for example, recognition of continuous digits over a microphone channel (small vocabulary, no noise) can be greater than 99%. If the system is trained to learn an individual speaker's voice, then much larger vocabularies are possible, although accuracy drops to somewhere between 90% and 95% for commercially-available systems. For large-vocabulary speech recognition of different speakers over different channels, accuracy is no greater than 87%, and processing can take hundreds of times real-time.

The dominant technology used in ASR is called the Hidden Markov Model, or HMM. This technology recognizes speech by estimating the likelihood of each phoneme at contiguous, small regions (frames) of the speech signal. Each word in a vocabulary list is specified in terms of its component phonemes. A search procedure is used to determine the sequence of phonemes with the highest likelihood. This search is constrained to only look for phoneme sequences that correspond to words in the vocabulary list, and the phoneme sequence with the highest total likelihood is identified with the word that was spoken. In standard HMMs, the likelihoods are computed using a Gaussian Mixture Model; in the HMM/ANN framework, these values are computed using an artificial neural network (ANN). For more details about HMM technology, as used in the HMM/ANN framework, see our tutorial.


Tools
The CSLU Toolkit is used for research activities related to speech recognition at CSLU. This toolkit allows the development of speech recognizers based on the standard HMM as well as HMM/ANN framework.

The Toolkit is written at two levels: the Tcl script level, and the C level. The Tcl level allows for easy implementation and modification of high-level procedures such as speech recognition, file selection, and recognizer evaluation. Computation-intensive procedures, such as wave I/O, neural network training and classification, and Viterbi search, are implemented at the C level. An intermediate level allows the C-level routines to be called from the Tcl-level scripts.

The CSLU Toolkit includes several tutorials for developing speech-recognition systems for custom or general-purpose vocabularies, as well as tutorials that explain how speech recognition is done using the HMM/ANN framework, how to read spectrograms, and how robust parsing can be done. Recognizers that are developed using the Toolkit can be quickly plugged in to the Toolkit's " Rapid Application Developer", and used in real-time in a variety of interactive applications.

The Toolkit also includes the SpeechView program for displaying, editing, playing, recording, and annotating waveforms and waveform information. We have found this tool extremely useful in the debugging process, as frame-based outputs from recognition scripts can be displayed in SpeechView in synchrony with the waveform and spectrogram.


Research
Research on automatic speech recognition at CSLU is focused on attaining high accuracy under real-life conditions. We have worked on large-vocabulary recognition as well as real-time, robust recognition of small vocabularies.

CSLU also has a long history of research on the extraction and integration of specific features into the speech classifier, where the features are motivated by theories of human speech recognition. One result of research in this area is the EAR system, a competitive recognizer for digits and letters of the alphabet. We have also developed and integrated new techniques for voicing detection and burst detection, and we have investigated the use of phonetic transition information in the classification process.

We also stress the easy development of new recognizers for different languages, channel conditions, or vocabularies. Toward this end, we have developed a tutorial for developing HMM/ANN based speech recognizers, and we have used the procedures in this tutorial for developing recognizers in English, Mexican Spanish, Italian, and Brazilian Portuguese. Others have used the tutorial for developing recognizers in Vietnamese, Slovenian, and Swedish. If you have used our tutorial for development of a recognizer in a different language, let us know.


Past Projects


Current and Ongoing Projects

People

The faculty in CSLU's speech recognition group are, in alphabetical order:

Peter Heeman, assistant professor.

John-Paul Hosom, assistant professor.

Yonghong Yan, associate professor.


Selected Publications
Here are some ASR-related publications from CSLU... for a more extensive listing, see our publications page.

 
J.P. Hosom, R.A. Cole, and P. Cosi. "Improvements in neural-network training and search techniques for continuous digit recognition." Australian Journal of Intelligent Information Processing Systems , vol. 5, no. 4, pp. 277-284, Summer 1998 (by invitation).

P. Cosi, J.P. Hosom, J. Schalkwyk, S. Sutton, and R. A. Cole. "Connected digit recognition experiments with the OGI Toolkit's neural network and HMM-based recognizers." In Proceedings, 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-ETWR98), pp. 135-140, Turin, Italy, September 1998.

J.P. Hosom, P. Cosi, and R.A. Cole. "Evaluation and integration of neural-network training techniques for continuous digit recognition." In Proceedings of the International Conference on Spoken Language Processing (ICSLP), volume 3, pp. 731-734, Sydney, Australia, November 1998.

Wei Wei and Sarel van Vuuren. "Improved neural network training of inter-word context units for connected digit recognition." In ICASSP'98, pp. 497-500, May 1998.

Yanghong Yan, Xintian Wu, Johan Shalkwyk, and Ron Cole. "Development of CSLU LVCSR: The 1997 DARPA HUB4 evaluation system." In DARPA Broadcast News Transcription and Understanding Workshop, 1998.

J.P. Hosom and R. A. Cole. "A diphone-based digit recognition system using neural networks." In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. 3369-3372, Munich, Germany, April 1997.

Zhihong Hu, Johan Schalkwyk, Etienne Barnard, and Ronald Cole. "Speech recognition using syllable-like units." In Proceedings: International Conference on Spoken Language Processing (ICSLP), pp. 1117-1120, Philadelphia, USA, October 1996.

M. Fanty, E. Barnard, and R. A. Cole. "Alphabet recognition." Handbook of Neural Computation, 1995. (by invitation)

P. Schmid and E. Barnard. "Robust, N-best formant tracking." In Proceedings of the Fourth European Conference on Speech Communication and Technology, Madrid, Spain, September 1995.

M. Fanty, R. A. Cole, and K. Roginski. "English alphabet recognition with telephone speech." In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances in Neural Information Processing Systems 4. San Mateo, CA: Morgan Kaufmann, 1992.

E. Barnard, R. A. Cole, M. Vea, and F. Alleva. "Pitch detection with a neural net classifier." IEEE Transactions ASSP, 39(298-307), 1991.

M. Fanty and R. A. Cole. "Spoken letter recognition." In Richard Lippmann, John Moody, and David Touretzky, editors, Advances in Neural Information Processing Systems 3, pages 220-226. Morgan Kaufmann, San Mateo, CA, 1991.

M. Fanty, R. A. Cole, and M. Slaney. "A comparison of DFT, PLP and Cochleagram for alphabet recognition." In Proc. 25th Asilomar Conference on Signals, Systems and Computers, pages 326-329, Pacific Grove, CA, November 1991.



Education
OGI offers the following classes related to automatic speech recognition:
CSE540:    Neural Network Algorithms and Architectures (Leen)
CSE544:    Introduction toProbability and Statistical Inference (Yang)
CSE547:    Statistical Pattern Recognition (Leen)
CSE548:    Modern Applied Statistics (Yang)
CSE550:    Spoken Language Systems (Heeman)
CSE551:    Structure of Spoken Language (Staff)
CSE552:    Hidden Markov Models for Speech Recognition (Hosom)
CSE58X:   Statistical Natural Language Processing (Heeman)
CSE561:    Dialogue (Cohen)
CSE562:    Natural Language Processing (Staff)
ECE 540:  Auditory and Visual Processing by Human and Machine (Hermansky, Pavel)
ECE 541:  Speech Processing (Hermansky)
ECE 545:  Speech Systems (Hermansky)
ECE 547:  Signals for Multimedia Engineering (Hermansky)

 
 
Links
Here is a small collection of speech-related links that we have found useful.
CSLU (Center for Spoken Language Understanding)
CSLR (Center for Speech and Language Research)
CMU's Speech Group
ICSI (International Computer Science Institute)
ISIP (Institute for Signal and Information Processing)
MIT Spoken Language Systems Group
PSL (the Perceptual Sciences Laboratory at UCSC)
Bell Labs: Statistical Models for Speech Recognition
Tony Robinson / Cambridge
Tlatoa
LDC (Linguistic Data Consortium)

 
 
Sponsors
Speech recognition research at CSLU has been sponsored by the CSLU Member Companies, as well as

The National Science
Foundation
(NSF)

Defense Advanced
Research Projects Agency
(DARPA)

Istituto di Scienze e Tecnologie della Cognizione
Sezione di Padova "Fonetica e Dialettologia"
Consiglio Nazionale delle Ricerche
(ISTC-SPFD, CNR, Italy)

Conselho Nacional de Desenvolvimento
Científico e Tecnológico (CNPq, National Council for Scientific and Technological Development of Brazil)


Last updated in November of 2003.
For questions or comments about this page, contact John-Paul Hosom.