Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!julius.cs.uiuc.edu!apple!agate!pasteur!galileo.berkeley.edu!jbuck
From: jbuck@galileo.berkeley.edu (Joe Buck)
Newsgroups: comp.dsp
Subject: Re: Compression Techniques for Speech
Message-ID: <9746@pasteur.Berkeley.EDU>
Date: 14 Dec 90 00:55:47 GMT
References: <77352@sgi.sgi.com> <wilf.660603845@rigel.sce.carleton.ca> <9508@pitt.UUCP> <367@rufus.UUCP> <1192@ncis.tis.llnl.gov>
Sender: news@pasteur.Berkeley.EDU
Reply-To: jbuck@galileo.berkeley.edu (Joe Buck)
Lines: 60

In article <1192@ncis.tis.llnl.gov>, turner@lance.tis.llnl.gov (Michael Turner) writes:
> And that's with all the time in the world to compress.  The best REAL-TIME
> compression I've heard about that preserves the signal significantly is some
> CELP (code-excited linear prediction, large codebook) technique (see recent
> IEEE AS&SP issues) that gets you down to 9600 baud.  However, you need a
> significant fraction of a Cray to run at that rate, according to the author.

There are several tricks to make CELP substantially faster without much loss
in quality.  Most involve imposing some structure on the codebook of vectors
so you can find the best match without computing distortions for every vector
(there are some connections with the theory of error-correcting codes here;
I don't understand them all).  Look in the proceedings of recent ICASSP
conferences with details.

> On the subject, however: I'm always on the look-out for NON-real-time
> compression algorithms (similar sampling rates, accuracy and compression
> ratio to the above problem).  I know about Moser, etc.  I'm most interested
> in techniques that exploit knowledge of perceptual limitations in hearing
> and production limitations in speech to figure out what parts of the
> raw signal can be thrown out.  Assume that the speech has already been
> "recognized" down to something like the phoneme level, and that this
> information can be used in the compression algorithm.  Assume also a
> single non-singing speaker with little background noise.  I'm interested
> in good extraction and reproduction of nasal antiresonances, subglottal
> coupling, pitch-pulse shape, etc.

Unfortunately, compression techniques that match up that specifically to
assumptions about the human speech reproduction system tend to make large,
bad-sounding errors when things don't match the model.  That's why there's
a general movement away from, for instance, models that depend strongly
on a voicing decision (like LPC).  You get big croaks when it screws up.
Still, for speech recognition purposes, LPC parameters contain valuable
information.

>  For higher (16KHz) rates, getting believable sibilance is high on my list
> as well.*  

George Kang of Naval Research Laboratory, who had a lot to do with the
design of the government LPC-10 algorithm, did a good deal of work on
this.  He argues that the old-fashioned carbon microphone's nonlinearities
are actually beneficial in the telephone system, becuase they map the
high frequencies of sibilants (especially for female speakers) down into
the passband of the phone system where they can be heard.  He did some
research on various types of nonlinear distortions to apply to speech
sampled at 16 KHz before downsampling to 8KHz, so that female speakers
would sound better when processed by LPC (female speakers generally
sound a good deal worse in LPC because of their smaller pitch periods
and higher-frequency sibilants).  He gave a very amusing talk at
an ICASSP about six years ago, using the test sentence

"Her purse was full of useless trash"

as a source of sibilants. :-)

It appears that the main problem with sibilants is the anti-aliasing
filter; there just isn't much energy in sibilants below 3.2 KHz.

--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck