Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watnot!watmath!clyde!rutgers!ames!ptsfa!ihnp4!vax135!cjp From: cjp@vax135.UUCP Newsgroups: comp.sys.amiga Subject: Re: Phonemes: why not just digitize them? Message-ID: <1761@vax135.UUCP> Date: Wed, 11-Feb-87 20:06:06 EST Article-I.D.: vax135.1761 Posted: Wed Feb 11 20:06:06 1987 Date-Received: Fri, 13-Feb-87 03:20:33 EST References: <663@goanna.oz> <5832@ukmj.ukma.ms.uky.csnet> <4111@utcsri.UUCP> Reply-To: cjp@vax135.UUCP (Charles Poirier) Organization: AT&T Bell Labs, Holmdel, NJ Lines: 50 Keywords: voice synthesis intonation expressive Summary: Expressor: I'd like more narrator flexibility While this subject is current, I'd like to post a few ramblings. First let me say that I value the accomplishments of the current Narrator, and to some extent appreciate the difficulty of the enhancements I propose here. I would, however, like to see something better. In playing with Narrator through "say" in phoneme mode (i.e. not invoking Translator), I have found it difficult to achieve a satisfactorily natural-sounding voice. Impossible, really. One problem is that there is not enough control available. The phoneme syntax gives you only one parameter, called "stress", that you can adjust (ignoring the parameters like "pitch" or "male" which affect the whole utterance). I feel there is a need for control of, say, three factors: volume, duration, and pitch. These things need to be controllable at the resolution *at least* of phonemes. I argue that even finer control is necessary for fully expressive, natural sounding voice. It is not just Thai that needs the ability to change pitch during a vowel (or other voiced) sound. You need to be able to slide the pitch and volume of the phoneme between two limits from its start to its end. Perhaps you even need to say what the "shape" of that slide is, chosen from a few such as exponential, negative exponential, linear. Try listening critically to the pitch and speed of someone talking -- using a tape recording (or digitized voice) helps -- and notice how much of a person's attitude and intent are communicated through intonation. I'm sure you've already noticed how little of it comes through in Narrator's voice. Let me call this hypothetical voice generator the Expressor. Now clearly, this type of voice generator is not meant to be driven by an automatic text translator. There is generally not enough information in text for even humans to derive accurate intentions and attitudes, let alone the problem of generating parameters which re-evoke those attitudes. But I think there would be many good and impressive uses for "canned" strings of phonemes, generated manually. I estimate that even a fully parameterized, inflected, modulated, and warbled word, expressed as a string of phonemes in Expressor syntax, would require a tiny fraction of the storage of a digitized sound sample saying more or less the same thing. One could store maybe hours of expressive, intelligible talk on a single disk instead of the (I forget the exact time) less than a minute of sampled sound. If done properly, if the parameters are given enough range and resolution (*much* more than 1 to 9), one could even take a good shot at synthesized singing. Well, enough of me talking through my hat. I certainly don't know how hard it would be to implement. It would be neat though. Comments, especially informed comments, are requested. Charles Poirier USENET vax135!cjp