Research in Speech Recognition & Understanding – B. JUANG – Georgia Tech

Speech Recognition and Understanding

We have pioneered and accumulated experience and knowledge in the area of automatic speech recognition and understanding in the past 2-3 decades. The following figure illustrated the technical paths that we have helped the research community walk through. This ensemble of techniques and technologies represent the foundation of most if not all automatic speech recognition systems in use today.

Development of fundamental techniques for automatic speech recognition

We continue to conduct research to lead the field by extending the technology along the following directions:

Robust speech recognition

A major challenge to the deployment of an automatic speech recognition system is how to maintain satisfactory recognition accuracy under all operating conditions. It is well know that the current technology would experience serious performance degradation if it is operated in a mismatch condition (i.e., the condition in which the recognizer was not designed for). We will focus on feature transformation and model adaptation techniques that respond rapidly to changes in the operating condition or mode in order to achieve a robust performance.

Discrete sequence representation

One major factor that influences the performance of an automatic speech recognition system is the embedded knowledge of the language itself, in terms of the grammar, the word sequence structure, and the associated semantics. The grammar, or traditionally called the language model under the current technology framework, is often expressed in finite state automata for its computational advantages. Natural language is obviously not a finite-state machine. Mathematical representation of a discrete sequence with arbitrary inter-symbol relationship is a hard program that has intrigued research for some time. We’ll explore new ideas in non-Markovian processes as candidates for language representation.

Semantic and emotional state detection

In many applications, the goal of the automatic speech recognition and understanding system is to identify the intent or intended action of the talker, rather than the exact word sequence. For example, in CRM (Customer Relationship Management) systems, the use of an automatic speech recognition and understanding system ranges from routing a customer’s call to the right help person to resolve issues, to recognizing the emotional state of a customer by detecting relevant keywords or the prosodic information. We plan to investigate the idea of latent semantic index (LSI) for semantic decoding as a supplement to language modeling, as well as a means for emotional state detection by creating a mathematical association between an emotional state and a set of relevant words as organized by the LSI scheme.

Natural dialog with referential semantics

The ability to invoke pre-existed references in semantic expressions is a major factor that contributes to the naturalness in human speech communications. Our conversation would become unwieldy and unnatural if we have to define every notion when it arises in the exchange. We have been able to demonstrate that incorporation of deep referential semantics in the dialog management design helps substantially in creating a natural language interface for the task of personal calendar management. We’ll extend the use of referential semantics to the speech decoding process, as a way to reduce recognition errors due to the implied semantic constraints, and to other tasks such as school course enrollment.

A natural language speech server for multi-channel multi-modal communications