Demystifying Speech Recognition – How it Works

A look under the hood of speech recognition engine technology…

After you have used computer-based speech recognition for a while, it is natural to wonder “How does it work?” When it does something unexpected, you will likely ask “Where did that come from?” In a new white paper titled “Demystifying Speech Recognition,” I have identified four key steps describing how speech recognition works in an ideal world. These steps also provide clues about what goes wrong in real-world situations and what you can do to improve accuracy and fully utilize the system.

Let’s take some of the mystery out of speech recognition by stepping through the process by which a computer turns your voice into text.

There are four steps in the conversion of voice to text:

  1. Audio to phonemes
  2. Phonemes to words
  3. Words to phrases
  4. Raw transcribed text to formatted text

The first step turns a continuous audio stream into the basic sounds of English, which are called phonemes.

US English has forty different sounds from which all the native words are built (see US English Phonemes). I say “native”, because English has adopted words from other languages, which may contain non-native sounds. For example, the “ch” in “chutzpah” is a phoneme, well known to New Yorkers, but it is not native. Similarly, “ö” and “ü” are common phonemes in German, but do not exist in US English. In order to identify the sequence of phonemes in continuous speech, the computer divides up the incoming audio into short time-slices called “frames”.

The second step is to translate phonemes into words. The recognizer uses a lexicon which contains all the words it knows about together with their pronunciations – each pronunciation is described using phonemes. For example, the pronunciation of the word “cat” has three phonemes: ‘k’, ‘ae’, ‘t’ (see US English Phonemes). Some words have multiple pronunciations and the lexicon has these too. Different words and phrases may share the same pronunciations – a simple example is “there” and “their”. After the engine has identified its “candidate” word sequences, we have to sort out which is the correct one.

The third step identifies the best sequence of words by using language modeling. A language model describes speech patterns in terms of words which are likely to be seen together. The language model helps us sort between the competing sequences of words from the conversion of phonemes into possible words and phrases. For example, suppose the recognition, thus far, has yielded two possible fragments — “over there” and “over their”. If the next word identified is “heads”, then the language model would help the engine choose “over their heads” as opposed to “over there heads”.

This brings us to the fourth and final step in the recognition process: formatting or normalization. This is where the clean-up happens – substitutions to make the text appear in a form that is most comfortable to read – punctuation; capital letters at the beginning of sentences; formatted dates, times, and monetary amounts; standard abbreviations; common acronyms; and so forth. For the most part, the formatting is handled by simple substitutions, which work like ‘search-and-replace’ in a word processor. One of SayIt’s features is that you can add your own formatting rules and/or substitutions. For example, in customer care, some clients replace “customer called in” with “CCI”, or in health care, some users replace “alert and oriented times three” with “A&Ox3”.

Next time, I will discuss how these four steps can help troubleshoot the most common problems encountered with speech recognition. 

More-latest speech technologies
Social share or comment – what do you think?

There are no comments yet. Be the first and leave a response!

Leave a Reply

You must be logged in to post a comment. Click here to log in.

Trackback URL http://www.speechtechnologygroup.com/demystifying-speech-recognition-how-it-works/trackback

Demystifying Speech Recognition – How it Works

A look under the hood of speech recognition engine technology…

After you have used computer-based speech recognition for a while, it is natural to wonder “How does it work?” When it does something unexpected, you will likely ask “Where did that come from?” In a new white paper titled “Demystifying Speech Recognition,” I have identified four key steps describing how speech recognition works in an ideal world. These steps also provide clues about what goes wrong in real-world situations and what you can do to improve accuracy and fully utilize the system.

Let’s take some of the mystery out of speech recognition by stepping through the process by which a computer turns your voice into text.

There are four steps in the conversion of voice to text:

  1. Audio to phonemes
  2. Phonemes to words
  3. Words to phrases
  4. Raw transcribed text to formatted text

The first step turns a continuous audio stream into the basic sounds of English, which are called phonemes.

US English has forty different sounds from which all the native words are built (see US English Phonemes). I say “native”, because English has adopted words from other languages, which may contain non-native sounds. For example, the “ch” in “chutzpah” is a phoneme, well known to New Yorkers, but it is not native. Similarly, “ö” and “ü” are common phonemes in German, but do not exist in US English. In order to identify the sequence of phonemes in continuous speech, the computer divides up the incoming audio into short time-slices called “frames”.

The second step is to translate phonemes into words. The recognizer uses a lexicon which contains all the words it knows about together with their pronunciations – each pronunciation is described using phonemes. For example, the pronunciation of the word “cat” has three phonemes: ‘k’, ‘ae’, ‘t’ (see US English Phonemes). Some words have multiple pronunciations and the lexicon has these too. Different words and phrases may share the same pronunciations – a simple example is “there” and “their”. After the engine has identified its “candidate” word sequences, we have to sort out which is the correct one.

The third step identifies the best sequence of words by using language modeling. A language model describes speech patterns in terms of words which are likely to be seen together. The language model helps us sort between the competing sequences of words from the conversion of phonemes into possible words and phrases. For example, suppose the recognition, thus far, has yielded two possible fragments — “over there” and “over their”. If the next word identified is “heads”, then the language model would help the engine choose “over their heads” as opposed to “over there heads”.

This brings us to the fourth and final step in the recognition process: formatting or normalization. This is where the clean-up happens – substitutions to make the text appear in a form that is most comfortable to read – punctuation; capital letters at the beginning of sentences; formatted dates, times, and monetary amounts; standard abbreviations; common acronyms; and so forth. For the most part, the formatting is handled by simple substitutions, which work like ‘search-and-replace’ in a word processor. One of SayIt’s features is that you can add your own formatting rules and/or substitutions. For example, in customer care, some clients replace “customer called in” with “CCI”, or in health care, some users replace “alert and oriented times three” with “A&Ox3”.

Next time, I will discuss how these four steps can help troubleshoot the most common problems encountered with speech recognition. 

More-latest speech technologies
Social share or comment – what do you think?

There are no comments yet. Be the first and leave a response!

Leave a Reply

You must be logged in to post a comment. Click here to log in.

Trackback URL http://www.speechtechnologygroup.com/demystifying-speech-recognition-how-it-works/trackback
css.php