Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!ames!ptsfa!ihnp4!vax135!cjp
From: cjp@vax135.UUCP
Newsgroups: comp.sys.amiga
Subject: Re: Phonemes: why not just digitize them?
Message-ID: <1761@vax135.UUCP>
Date: Wed, 11-Feb-87 20:06:06 EST
Article-I.D.: vax135.1761
Posted: Wed Feb 11 20:06:06 1987
Date-Received: Fri, 13-Feb-87 03:20:33 EST
References: <663@goanna.oz> <5832@ukmj.ukma.ms.uky.csnet> <4111@utcsri.UUCP>
Reply-To: cjp@vax135.UUCP (Charles Poirier)
Organization: AT&T Bell Labs, Holmdel, NJ
Lines: 50
Keywords: voice synthesis intonation expressive
Summary: Expressor: I'd like more narrator flexibility


While this subject is current, I'd like to post a few ramblings.  First
let me say that I value the accomplishments of the current Narrator,
and to some extent appreciate the difficulty of the enhancements I
propose here.  I would, however, like to see something better.

In playing with Narrator through "say" in phoneme mode (i.e. not
invoking Translator), I have found it difficult to achieve a
satisfactorily natural-sounding voice.  Impossible, really.  One
problem is that there is not enough control available.  The phoneme
syntax gives you only one parameter, called "stress", that you can
adjust (ignoring the parameters like "pitch" or "male" which affect the
whole utterance).

I feel there is a need for control of, say, three factors: volume,
duration, and pitch.  These things need to be controllable at the
resolution *at least* of phonemes.  I argue that even finer control is
necessary for fully expressive, natural sounding voice.  It is not just
Thai that needs the ability to change pitch during a vowel (or other
voiced) sound.  You need to be able to slide the pitch and volume of
the phoneme between two limits from its start to its end.  Perhaps you
even need to say what the "shape" of that slide is, chosen from a few
such as exponential, negative exponential, linear.  Try listening
critically to the pitch and speed of someone talking -- using a tape
recording (or digitized voice) helps -- and notice how much of a
person's attitude and intent are communicated through intonation.
I'm sure you've already noticed how little of it comes through in
Narrator's voice.

Let me call this hypothetical voice generator the Expressor.  Now
clearly, this type of voice generator is not meant to be driven by an
automatic text translator.  There is generally not enough information
in text for even humans to derive accurate intentions and attitudes,
let alone the problem of generating parameters which re-evoke those
attitudes.  But I think there would be many good and impressive uses
for "canned" strings of phonemes, generated manually.  I estimate that
even a fully parameterized, inflected, modulated, and warbled word,
expressed as a string of phonemes in Expressor syntax, would require a
tiny fraction of the storage of a digitized sound sample saying more or
less the same thing.  One could store maybe hours of expressive,
intelligible talk on a single disk instead of the (I forget the exact
time) less than a minute of sampled sound.  If done properly, if the
parameters are given enough range and resolution (*much* more than 1 to
9), one could even take a good shot at synthesized singing.

Well, enough of me talking through my hat.  I certainly don't know how
hard it would be to implement.  It would be neat though.  Comments,
especially informed comments, are requested.

Charles Poirier   USENET vax135!cjp