Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!usc!elroy.jpl.nasa.gov!lll-winken!ncis.tis.llnl.gov!lance.tis.llnl.gov!turner
From: turner@lance.tis.llnl.gov (Michael Turner)
Newsgroups: comp.dsp
Subject: Re: Compression Techniques for Speech
Message-ID: <1192@ncis.tis.llnl.gov>
Date: 13 Dec 90 23:15:58 GMT
References: <77352@sgi.sgi.com> <wilf.660603845@rigel.sce.carleton.ca> <9508@pitt.UUCP> <367@rufus.UUCP>
Sender: news@ncis.tis.llnl.gov
Organization: University of California, Berkeley
Lines: 59

In article <367@rufus.UUCP> drake@drake.almaden.ibm.com writes:
>In article <9508@pitt.UUCP> dcollins@pittslug.sug.org.UUCP (Daniel Collins) writes:
>>I am working on a project where I have to compress a speech signal  
>>by 20-to-1 ratio in real-time.  If 20-to-1 not practical, what are 
>>limiting factors, and what is achievable I will be using an 8 bit ADC 
>>with a 8KHz conversion rate.  
>
>So from 64 Kbits/second you want to do 20:1 compression, down to 3200 bits
>per second?  Pretty aggressive.  There's a company with a product that
>hooks to a standard serial port that claims to be able to do speech at
>1100 bits per second; the product is from Digispeech, called the DS201.  
>The compression algorithm seems to be proprietary.
>
>Sam Drake / IBM Almaden Research Center 
>Internet:  drake@ibm.com            BITNET:  DRAKE at ALMADEN
>Usenet:    ...!uunet!ibmarc!drake   Phone:   (408) 927-1861

And that's with all the time in the world to compress.  The best REAL-TIME
compression I've heard about that preserves the signal significantly is some
CELP (code-excited linear prediction, large codebook) technique (see recent
IEEE AS&SP issues) that gets you down to 9600 baud.  However, you need a
significant fraction of a Cray to run at that rate, according to the author.

Of course, there's the wonderful Apple Macintosh compression technique that
runs in sublinear time: just throw out samples.  But I assume you want to
be able to understand the speech it when it's played back.

I suggest you revise your constraints, either the compression ratio or
the real-time response, or both.  I'm no expert (yet) at speech compression,
but I think you're out of luck.

On the subject, however: I'm always on the look-out for NON-real-time
compression algorithms (similar sampling rates, accuracy and compression
ratio to the above problem).  I know about Moser, etc.  I'm most interested
in techniques that exploit knowledge of perceptual limitations in hearing
and production limitations in speech to figure out what parts of the
raw signal can be thrown out.  Assume that the speech has already been
"recognized" down to something like the phoneme level, and that this
information can be used in the compression algorithm.  Assume also a
single non-singing speaker with little background noise.  I'm interested
in good extraction and reproduction of nasal antiresonances, subglottal
coupling, pitch-pulse shape, etc.  For higher (16KHz) rates, getting
believable sibilance is high on my list as well.*  A parametric
representation that allows control of variation of stress factors
(duration, pitch, amplitude) is important.

I see that the relevant techniques are out there, but I'm having trouble
finding all them all and putting them all together.  It doesn't help that I'm
a nearly-total neophyte with DSP, which seems to the dialect of greek that
most of the relevant literature is written in.  On the other hand, I do know
something about phonetics and phonology, which is swahili to a lot of
DSP folk.
---
Michael Turner
turner@tis.llnl.gov

* Just today I was talking on some bandwidth-limited line to a friend
  who couldn't understand that I was talking about our mutual friend
  BRUCE, not somebody I'd never met named RUTH.