Messenger - Vol. 2, No. 3, Page 13
Summer 1993
On Research
Transitional sound improves synthetic voice

     World-renowned physicist Stephen William Hawking, like many people
with amyotropic lateral sclerosis, a progressive disease of the nervous
system, uses a special device to communicate in a synthetic voice.
     In the past, unfortunately, people like Hawking have been forced to
choose between a pleasant, pre-recorded voice with a very limited
vocabulary or a jarring, computerized voice that could say anything.
     To address this problem, researchers at the Center for Applied Science
and Engineering in Rehabilitation, a joint program of the University of
Delaware and the A.I. du Pont Institute, developed a way to create
natural-sounding voices that don't limit what the speaker can say. The new
technology works so well, in fact, that it recently was licensed to ACS
Technologies of Pittsburgh, Pa., and Echo Speech Products of Carpinteria,
Calif., who plan to market it to the public.
     Thanks to the efforts of center researchers and the University's
patent and research office, improved synthetic voices will soon be
available for men, women and children who speak either English or Spanish,
reports center director Richard Foulds, who also is a research professor in
the Department of Computer and Information Sciences.
     How does the new speech synthesis technology work?
     Traditionally, Foulds explains, "natural-sounding" synthetic voices
have been based on audio recordings of words that can be strung together in
a limited number of sentences. Telephone companies often use such
recordings to let a caller know, for instance, when a number has been
changed.
     More intelligent voice systems use synthetic versions of the sounds of
the 44 phonemes in the English language. A phoneme is a class of closely
related speech sounds that are represented linguistically by the same
symbol.
     By typing each phoneme of a word, such as c-a-t for "cat," a speaker
can say anything.
     Synthetic speech based on phonemes is versatile, Foulds points out,
but it's so hard on the ears that it can create emotional barriers to
one-on-one communication. This is because human speech actually flows from
one sound to the next. Between two phonemes like "ee" and "oo," there is a
transitional sound known to researchers as a diphone.
     "Saying that each of the 44 phonemes is completely separate from each
other is like saying that dance is a bunch of static positions," Foulds
explains. "Obviously, what makes dance beautiful is the flowing of one
movement into the next. Speech is the same way."
     Using diphones instead of phonemes, center researchers produced
natural-sounding voices without compromising vocabulary. Diphones were
first extracted from audio recordings, then "digitized" or translated to an
algorithm that could be recognized by a computer.
     In the future, Foulds says, researchers would like to generate a more
diverse menu of voices, as well as individualized voices for those with
amyotropic lateral sclerosis and others who slowly lose their speech. The
center is also developing a method to let non-speakers control the tone and
inflection of a synthetic voice-expressing anger, perhaps, by typing in red
letters or underlining certain passages.
     Better synthetic speech is just one of the successful technologies
produced at the center, which is housed at A.I. du Pont Institute in
Wilmington, Del. Also under way are computer programs that speed the rate
of synthetic speech by 30 to 40 percent, camera systems that some day could
help a non-speaker type by simply gazing at a keyboard and "intuitive"
robots and programs to transmit sign language pictures over standard
telephone lines.
     With private, state and federal support, the center is removing
barriers that prevent individuals with disabilities from achieving their
full potential.
                                   -Ginger Pinholster