Spectrogram Reading

Spectral Cues for the Broad Categories of Speech Sounds

Utterances in English and other languages can be analyzed into a sequence of abstract units called phonemes like those in the table to the left, which includes a complete set of phonemes for American English. Subsets of these phonemes can be grouped into categories according to the type of phonation involved, that is, what the speaker is doing with his or her vocal organs to create the speech sound in question.

There are many ways in which speech sounds can be grouped. Our primary school categories of vowel and consonant capture the most basic contrast in speech sounds; can a phoneme serve as a syllable nucleus or not? Another fundamental criterion for separating speech sounds into categories concerns the mode of phonation, in which case we have four categories:

  1. Sonorants - Speech sounds which are characterized mainly by voicing, the repetitive opening and closing of the vocal cords, comprise a majority of all speech sounds, 25 of the 41 phonemes in a minimal set of American English.
  2. Fricatives - Friction sounds, 9 of the 41 phonemes.
  3. Plosives - Explosion sounds (including affricates which also contain a frication component), 8 of the 41 phonemes .
  4. Silence - Phrase marker, breathing, and even mini-silences which are phonemic, the closures before a plosive.

According to this classification scheme, as a linguist once pointed out, human language is a sequence of buzzes (voicing), hisses (frication), and pops (plosive), but of course we humans do not hear it like that.

For the purposes of spectrogram reading, I prefer to divide speech sounds into nine categories as set forth below, each of which has a distinctive signature in the spectrogram, which is a representation of the phonation types, and ultimately of the phonemes present in the utterance. We speak by using sound to produce complex three-dimensional patterns in the two-space of time and frequency. The patterns are so complex, in fact, that it takes a little time and practice to learn how to read them. But it is well worth the effort, because in this way and only in this way can one appreciate something of the complexity and beauty of those patterns which we code and decode so easily with our ears and the neuronal structures which are "attached" to the ears, including some of the highest cortical processing areas.

Below are the nine categories, with a description of the spectral characteristics and example spectrograms for each category. The categories are listed in the same order as the phonemes in the panel to the left; to see spectrograms for each American English phoneme in the phoneme list, click on the appropriate symbol or word.

     

  1. Monophthong vowels These are characterized by strong stable voicing, as represented in Figure 1 below. The formants in the vowel, visible as a grouping of three components: (1) a red band of increasing energy, (2) a maximum in green and yellow, and (3) decreasing energy in blue, are stable in time. Geometrically this means that they are horizontal, showing no motion in the y-axis of frequency. In Figure 1, we are seeing F1 and F3, since F2 has been absorbed into F1 .

    Figure 1 - Monophthong /A/ from the utterance "ah."


     

  2. Diphthong vowels The diphthongs have strong moving voicing, as represented in Figure 2 below. The formants are not horizontal throughout the life of the vowel as they were in the monophthong vowels, but move from a beginning configuration to a target configuration.

    Figure 2 - Diphthong /aI/ from the utterance "eye" with /a/ passing into /I/.


     

  3. Approximants The liquids (/9r/ and /l/) and glides (/j/ and /w/) have formants which are less pronounced than those of vowels, because of a slight obstruction placed somewhere along the vocal tract which creates a unique signature for each approximant, as represented by Figure 3 below.

    Figure 3 - Two liquids and a vowel from the utterance "real", with /9r/ symbolized by F3. lower than 2000 Hz at the beginning, the stable vowel /i:/ in the center, and /l/ symbolized by the wide jaw-like opening of a gap between F2 and F3 at the end of the spectrogram.


     

  4. Nasals The nasals have much less energy than any of the previous phonation categories. This is because the oral tract is completely blocked, and sound waves radiate principally from the nose. There is a characteristic nasal "zero" or region of extremely low energy.

    Figure 4 - Two nasals and a vowel from the utterance "mean", with /m/ symbolized by the rapid rise of F2 from 900 Hz at the beginning, the stable vowel /i:/ in the center, and /n/ symbolized by a less dramatic fall toward 1800 Hz at the end.


     

  5. Fricatives The fricatives do not necessarily involve any voicing, although the voiced fricatives may have a very low voice bar as in the /v/ in Figure 5 below. The signature of fricatives is in their high-frequency regions, which are more random in their energy distribution than voicing.

    Figure 5 - Two fricatives and a vowel from the utterance "save", with /s/ symbolized by the opening frication rectangle, the diphthong vowel /ei/ in the center, and /v/ symbolized by a drop in voicing and the high-energy plume of frication at the end.


     

  6. Plosives The plosives involve an explosive burst of acoustic energy following a short period of silence; because of the silence during which the vocal tract is completely blocked, these phonemes are also called stops. The signature of plosives is an almost instantaneous passage from little or no acoustic energy to a short burst of high-energy in a wide frequency band. The plosives, like the fricatives, may be accompanied by voicing.

    Figure 6 - Two plosives and a vowel from the utterance "tide", with the burst of /t/ followed by aspiration, the diphthong vowel /aI/ in the center, and voicing continuing through the closure of /d/, the release for which is at the right.


     

  7. Flaps Flaps are abbreviated forms of the alveolar plosives /t/ and /d/ and the alveolar nasal /n/. In a normal alveolar plosive closure, the vocal tract is blocked for some 50 ms, but in the flap, produced by one rapid tap of the tongue against the alveolar ridge, the duration is very short, on the order of 10-20 ms. The flap is very common in American English.

    Figure 7 - The word "rider" with the initial /9r/, the diphthong /aI/, the central flap /d_(/, and the final r-flavored reduced vowel /&r/.


     

  8. Affricates The affricates /tS/ and /dZ/, as their Worldbet symbols show, are compounds of a plosive and a fricative. The plosive is much reduced from the full /t/ or /d/, usually showing as one or more thin bars to the left of the large rectangle of frication.

    Figure 8 - Two examples of the same affricate and a vowel from the utterance "church". The affricates are /tS/, while the central vowel is /3r/.


     

  9. Syllabics When liquids or nasals occur in an unstressed syllable, the vowel is often merged into the liquid or nasal, which becomes syllabic in that it bears the weight of the syllable. The spectral appearance of the syllabics is midway between that of a vowel and that of a liquid or nasal.

    Figure 9 - The word "button" with the initial weak plosive /b/, the back vowel /^/, the flap /th_(/, and the final syllabic nasal /n_=/.



Maintained by Tim Carmell
carmell@cse.ogi.edu