Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!swrinde!zaphod.mps.ohio-state.edu!julius.cs.uiuc.edu!apple!agate!pasteur!galileo.berkeley.edu!jbuck From: jbuck@galileo.berkeley.edu (Joe Buck) Newsgroups: comp.dsp Subject: Re: Compression Techniques for Speech Message-ID: <9746@pasteur.Berkeley.EDU> Date: 14 Dec 90 00:55:47 GMT References: <77352@sgi.sgi.com> <9508@pitt.UUCP> <367@rufus.UUCP> <1192@ncis.tis.llnl.gov> Sender: news@pasteur.Berkeley.EDU Reply-To: jbuck@galileo.berkeley.edu (Joe Buck) Lines: 60 In article <1192@ncis.tis.llnl.gov>, turner@lance.tis.llnl.gov (Michael Turner) writes: > And that's with all the time in the world to compress. The best REAL-TIME > compression I've heard about that preserves the signal significantly is some > CELP (code-excited linear prediction, large codebook) technique (see recent > IEEE AS&SP issues) that gets you down to 9600 baud. However, you need a > significant fraction of a Cray to run at that rate, according to the author. There are several tricks to make CELP substantially faster without much loss in quality. Most involve imposing some structure on the codebook of vectors so you can find the best match without computing distortions for every vector (there are some connections with the theory of error-correcting codes here; I don't understand them all). Look in the proceedings of recent ICASSP conferences with details. > On the subject, however: I'm always on the look-out for NON-real-time > compression algorithms (similar sampling rates, accuracy and compression > ratio to the above problem). I know about Moser, etc. I'm most interested > in techniques that exploit knowledge of perceptual limitations in hearing > and production limitations in speech to figure out what parts of the > raw signal can be thrown out. Assume that the speech has already been > "recognized" down to something like the phoneme level, and that this > information can be used in the compression algorithm. Assume also a > single non-singing speaker with little background noise. I'm interested > in good extraction and reproduction of nasal antiresonances, subglottal > coupling, pitch-pulse shape, etc. Unfortunately, compression techniques that match up that specifically to assumptions about the human speech reproduction system tend to make large, bad-sounding errors when things don't match the model. That's why there's a general movement away from, for instance, models that depend strongly on a voicing decision (like LPC). You get big croaks when it screws up. Still, for speech recognition purposes, LPC parameters contain valuable information. > For higher (16KHz) rates, getting believable sibilance is high on my list > as well.* George Kang of Naval Research Laboratory, who had a lot to do with the design of the government LPC-10 algorithm, did a good deal of work on this. He argues that the old-fashioned carbon microphone's nonlinearities are actually beneficial in the telephone system, becuase they map the high frequencies of sibilants (especially for female speakers) down into the passband of the phone system where they can be heard. He did some research on various types of nonlinear distortions to apply to speech sampled at 16 KHz before downsampling to 8KHz, so that female speakers would sound better when processed by LPC (female speakers generally sound a good deal worse in LPC because of their smaller pitch periods and higher-frequency sibilants). He gave a very amusing talk at an ICASSP about six years ago, using the test sentence "Her purse was full of useless trash" as a source of sibilants. :-) It appears that the main problem with sibilants is the anti-aliasing filter; there just isn't much energy in sibilants below 3.2 KHz. -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck