Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!cmcl2!kramden.acf.nyu.edu!brnstnd
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Newsgroups: comp.compression
Subject: Re: 1) decompression speeds, 2) rounding.
Message-ID: <9872.Jun2720.37.4091@kramden.acf.nyu.edu>
Date: 27 Jun 91 20:37:40 GMT
References: <894@spam.ua.oz>
Organization: IR
Lines: 114

The ``average speed'' taken by a compressor over different types of
input files is meaningless. Do you know anybody who compresses exactly
as many bytes of ``book1'' per day as of ``obj1'', and exactly as many
of ``news'' and ``trans'' and ``progp''? I don't. (Worse than that, it's
a harmonic average, so even if someone *does* compress exactly as much
of each type of file, the ``average speed'' will not equal his real
speed. ``Average speed'' means if you spend the same amount of *time* on
each type of text, how much do you get compressed overall. Ugh.)

Most people deal with some combination of ``book1'' and ``progc''
(archives) and ``news''. Their average will be very different from the
overall average on any given text corpus. The *only* meaningful
statements you can make are about the boundary cases: ``The worst
observed speed from this compressor is 30K/sec, the best observed speed
is 60K/sec,'' and then how it does on each file type.

In fact, one excellent way to display benchmarks with varying parameters
(the parameter in this case being the input file type) is the following:
(1) Select the benchmarks to be displayed: in this case, compression
effectiveness and compression time, on a two-dimensional chart. (2) For
each value of the parameter, plot a labelled point at the result. (3)
Draw the convex hull of the points; i.e., connect the outer dots. Repeat
for each compression method under consideration.

Now what do the statistics mean on this chart? Each point reflects a
result for one method on one file type. The top of the convex hull for
one method is the maximum speed; the bottom is the minimum speed; the
left is the maximum effectiveness; the right is the minimum
effectiveness. The ``spread'' of the convex hull, i.e., its area, shows
how sensitive the method is to the file type.

No matter what kind of files someone compresses, his results will always
be a convex combination of some of those outer points. Plotting the
``average'' speed or ``average'' effectiveness means that you're taking
the center of mass of the points---as if everyone compressed just as
much of every kind of file---and ignoring the interesting regions on the
edge of the convex hull.

End of sermon A. Begin sermon B.

Even without meaningless averages there's way too much information for
one person to take in all at once. Another dose of reality: People don't
really want to get some abstract numeric measure of ``how good'' a
compressor is on all types of text. They know beforehand that they're
only interested in news, say, or progc.

SO SHOW THE RESULTS FOR *ALL* COMPRESSORS ON *ONE* FILE IN *ONE* PLACE!

Someone who's deciding what to do with his newsfeed wants to see a
single row or column showing how *every* compressor does on a sample of
news. Witness this excerpt from my Y coding paper:

     __bib__book1__book2____geo_makelog___news___obj1___obj2_paper1_paper2_
Z12  54094 394419 325147  78026   34757 230765  16528 160099  29433  40881
Z14  46817 357387 281354  77696   25647 202594  14048 138521  25077  37196
Z16__46528_332056_250759__77777___25647_182121__14048_128659__25077__36161_
MW___41040_336793_230862__80296____8299_168652__13273_109266__22749__34234_
AP2  47056 389702 297205  84582   20925 219665  13824 134547  26937  39415
AP6  40770 338046 261270  79471   14762 190502  13824 123323  22413  34637
AP3__40311_322178_228978__80106____8821_167896__13825_113296__22414__33320_
Y2   46882 363339 287110  80817   28159 212617  13858 141783  26131  38037
Y6   40874 320622 256578  76275   21220 185097  13858 125900  22452  33671
Y3___40456_306813_229851__76695___14411_168287__13859_114323__22453__32733_

See how easy it is to choose a news compressor? You can see immediately
that MW (squeeze, 300000-string table), Y3 (yabba -m300000), and AP3
(whap -m300000) have approximately the same effectiveness, while Z16
(compress -b16) is only slightly better than Y6 (yabba -m65533). This
table could be extended to any length for any number of compressors and
it would still be usable. You could even add compression and
decompression speeds, rankings between the compressors, etc. What's
important is that all the information for a single file type be
collected in one place.

Now does anyone really see a need for average compressor speeds or
average performance?

In article <894@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes:
> When quoting speeds of decompression, is everyone agreed that the
> speed should be given relative to the decompressed data?

Yes. Otherwise the number cannot be compared directly to the compression
speed. Also, speed is a measure independent of the actual compression
method; speed per compressed byte is not.

> For each file, it records the compression
> performance (e.g. 3.48 bits per byte) in a floating point variable.

Why do you persist in that ugly bits-per-bytes measure? People care
about how much space they save. Take out the factor of 8 already.

  [ do you average before rounding or after? ]

Never do anything with rounded numbers unless you absolutely have to.

> I did some analysis and worked out that the "deep" average (the
> rounding of the average of the unrounded values) and the "shallow"
> average (the rounding of the average of the rounded values) can differ
> by up to one rounded digit unit value (Example: try 1.5 and 2.5
> rounding to 1dp).

1.5 and 2.5 both round to 2, so the average is the same either way.

> Choosing the deep average means that if anyone attacks my statistics
> with a calculator at a later date, they might conclude that I made a
> mistake.

Why do you think charts published in newspapers always have a disclaimer
like ``Percentages may not add up to 100% because of rounding''? Anyway,
if you do the sensible thing and include the original data (how many
bytes of output? how much time did it take to your machine's clock
precision?), nobody can possibly attack your results.

---Dan