Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!cmcl2!kramden.acf.nyu.edu!brnstnd
From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Newsgroups: comp.compression
Subject: Re: Proposed standard: Replies to criticisms.
Message-ID: <12900.Jun2722.22.2191@kramden.acf.nyu.edu>
Date: 27 Jun 91 22:22:21 GMT
References: <899@spam.ua.oz>
Organization: IR
Lines: 320

I don't mean to be cruel, Ross, but I suggest that we all ignore your
``standard'' until you have squashed two or three separate compressors
(LZW, AP, and LZRW1, say) into the same mold and published the results.
This is how all standards should work: *Do something first*, and publish
it. Then tell us about the important parts of the interface, cutting out
irrelevant details that might take up a lot of code and spending time on
important restrictions that don't involve any single section of code.
Say ``the routine must do blah'' if you think blah is necessary for the
user. Say ``the user must not do foo'' if you don't think anyone will
ever want foo in routines written to the standard. Say ``the routine may
or may not do bar'' if you did it both ways. Voila, a standard.

None of this goes anywhere unless you first publish working code.

In article <899@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes:
  [ memory ]

The obvious solution is for the compression routine to provide its own
memory. This doesn't fit well with I/O-driven compression, though, or
with non-parametrized standards.

  [ pointers ]

Face it, Ross: There are a whole lot of compressors which cannot be
written efficiently in Pascal or even Ada. ``An interface that even
fools can use will be used only by fools''; well, by making your
standard wimpier and wimpier so that the routines can be run on wimpier
and wimpier machines in wimpier and wimpier languages, you pretty much
guarantee that it will only be used that way. If you're going to write
the standard before the implementation, you might as well take a lesson
from POSIX: Use pointers where they obviously belong, don't make design
decisions based on how difficult you think the interface will be to use
in Ada, and postpone other language bindings to the next decade.
 
> Perhaps the algorithm could advertise how much heap and how much stack
> it will use.

An excellent suggestion. (Naturally, my yabbawhap package includes a
``checkconf'' program which advertises this information, given the
compile parameters.)

> |Ofcourse there this standard code has to be ported to a number of platforms
> |but since almost every archive programm is ported anyway this is not much
> |of a problem.
> |Identification codes have to be obtained from some central point where all
> |the codes are registered.
> No need. Just get everyone to generate random n-bit IDs. Make n large
> enough so that a collision is extremely unlikely.

Great. Now IDs are almost totally useless. What are you trying to
accomplish with those numbers?

  [ parameters ]
> The reason why I am so heavy on parameters I think stems from seeing
> dozens of papers in which they are not specified properly.

Writing a crippled compression standard is a poor way to express your
dissatisfaction with how papers are written.

> |> If a user wants to create another  tuning, he can just fiddle with the
> |> source code  for the algorithm until  he is happy and  then fiddle the
> |> identity record to create a new  algorithm.
> |Uh-uh. This contradicts your stated goal of having the identification
> |numbers mean something for comparison purposes.
> Absolutely not, because if the user changed anything he would have to
> (according to standard) generate a new algorithm name and ID.

Sorry I didn't make this clear. I publish yabba with a default of
-m65533. I give it some ID. Joe Shmoe changes the default to -m300000
and gives it another ID. Sally Shmoe changes the default to -m300000 and
gives it yet another ID.

Now if Joe and Sally choose their IDs randomly, how do we tell that
they're using the same method? Or, for that matter, that they've just
changed a parameter in my method?

> |If you want to achieve that goal, just insist that everyone name all the
> |parameters (in my case, -m/-z/-Z) when quoting compression results. (It
> |gets even hairier when you're measuring speed and memory: for yabba,
> |you'd have to quote MOD and PTRS as well as the machine type.) Don't
> |force parameterized coders into a parameterless model.
> Well, maybe. But the algorithm designer has to commit himself on
> parameters to some extent. SAKDC has 25 parameters and I can dial up a
> multitude of adaptivities and memories and different order models.

I fail to see how this matters. The algorithm is parametrized. Period.

> |> (Another  argument is  that many  algorithms can  be implemented  more
> |> efficiently if their parameters are statically specified).
> |Hardly.
> Absolutely. You should read the other critics. I have you telling me
> that dynamic parameters are not much less efficient than static ones.
> Meanwhile, there's the mob who don't like passing a pointer to
> read/write memory for the algorithm to use because they want to avoid
> indirect addressing in their code and pull tricks with the address
> range.

Right. Remember, we're talking about *can be implemented*, not can be
implemented under your awful set of restrictions.

Someone might have a compression method where he compares n to a
parameter, x, in the innermost loop. Does that mean that the algorithm
can be implemented more efficiently if their parameters are statically
specified? No, because every use of n can be ``flipped'' into x - n, and
this can be compared against 0.

Sure, there will exist parameters (like the hash function) where you
just can't do something like this. You still don't have to make those
parameters fixed. You can just tell the compiler to optimize out a
separate version for those parameter values:

   int my_special_hash(h,c) int h; int c; { ... }

   static int generic_routine(input,hash)
   char *input;
   int (*hash)();
   {
    ... all the compression code, parameterized on ``hash'', goes here ...
   }

   static int optimized_routine(input)
   char *input;
   {
   #pragma begin-inline-fun generic_routine
    return generic_routine(input,my_special_hash);
   #pragma end-inline-fun generic_routine
   }

   int user_visible_routine(input,hash)
   char *input;
   int (*hash)();
   {
    if (hash == my_special_hash)
      return optimized_routine(input);
    return generic_routine(input,hash);
   }

Of course, this is oversimplified, and with current C compilers you have
to stick real_routine inside one big, ugly (but portable) macro instead
of using pragmas. But it works, it avoids code duplication, and it gives
you all the benefits of fixing the parameter to begin with. In fact, if
the compiler inlines something like

   user_visible_routine(myinput,my_special_hash)

in the user code, it will end up *exactly* as efficient as if
my_special_hash had been used everywhere in the first place. The only
disadvantage is the extra space needed for generic_routine, and that can
be avoided if you stick the routines inside a library to begin with.

I believe the MIPS compilers can do all of the above, including inlining
from ucode libraries. They don't have the #pragma but that's just
syntactic sugar. Even under good old pcc, all you lose for the
parametrization is a comparison and an extra function call on each call
to the routine.

> It is a fact that statically binding parameters allows lots of
> optimizations.

Indeed... I don't complain that something can't be done. I just do it.
Are you so sure now that taking parameters out of the interface allows
better optimization?

>    * They tend to confuse users who don't know how to set the
>      parameters.

As long as every parameter has a default, this is moot.

>    * They must be algorithm specific.

So what?

>    * They can overflow and be out of range and all the other things.

True. Another advantage of a program-based interface is that it can take
the time to check such things.

> The idea of having default settings and then allowing the user to
> specify otherwise at his own risk dooesn't sound so bad so long as:
>    * The settings do not affect memory consumption.

There you go again.

How about you try to fit compress into this model. If you manage to
write one version which doesn't take too much memory for -b12, but
allows -b16, but doesn't have a memory-affecting parameter, and doesn't
allocate memory itself anyway, I'll be very impressed.

  [ compressor driving I/O versus I/O driving compressor ]
> This may be so, but placing the compressor in charge means that you
> have to tell it how to do IO, which is so messy in most programming
> languages (requiring Ada like generic superroutines or C pointers to
> IO functions) as to as to completely destroy the standard.

See my general comments near the top. You have that choice: either use a
language which can compete with machine language, or make your standard
so wimpy that nobody will want to use it.

> I'm not saying that there won't be a little pain in converting some
> algorithms over to the standard (all good standards cause lots of pain
> :-), but I am sure that there will be very little loss of efficiency
> and great benefits.

So do it for compress. I want to see something that works just like
compress but uses your standard for an internal interface.

> |> I am wary of a stream interface because it is then impossible to place
> |> a bound on the amount of output one could get from feeding in a single
> |> byte of  input (during either  compression or decompression).
> |Again it sounds like you're focusing on modems. Why?
> I'm NOT focussing on modems - you're the one who keeps mentioning them!!!
> My logic goes something like this:
>    - We NEED to make the algorithm a procedure (not a process).
>    - We NEED to use memory blocks to feed bytes in and out.
>      (although the blocks can be part of a stream interface).
> (I hope you agree that the alternatives to these are horrific)

No, I don't agree with either one. The argument here is about the
second. You only need a bound on output because you have a block
interface, and unless you wanted to handle modems you could use a stream
interface instead. There's nothing horrific about ignoring modems and
sticking to a stream interface.

> |One advantage of this for C is that in those cases where you want to
> |hard code the parameters you can #define them in the source of the
> |compression function without changing the function's formal
> |parameters.  (It seemed that Dan doesn't believe there can be an
> |advantage in hard coding the parameters, which I find peculiar.  There
> |are many optimizations that can be done with known values over
> |variables.)
> If the user has the source code, he can do anything anyway. The real
> case for parameters lies in the case of someone who only has an
> executable.

See above. It doesn't matter whether you have the entire source or just
a library. Hard-coding parameters barely speeds anything up.

> Agreed. If the standard does allow parameters, they must go through
> the front door. What do you think of a string interface? E.g:
> compress(comp,mem,inblk,outblk,"M=4,K=91,Z=512");

Not too bad.

> I really don't think there would be any problem with implementing UNIX
> compress if the flush flag were restored. As I understand it, the LZW
> algorithm's context is very simple consisting of just a table, an
> index and possibly one or two coding scalars.

Well, don't just understand it. Do it. Then tell us what you did.

> As a separate point, I don't see how a program-based standard could be
> specified at this level without assuming that languages have processes
> or superroutines or procedures as parameters.

Hmmm? What do you mean by that?

> |> I'd also like to repeat someone else's suggestion of seperating the
> |> compression procedure and decompression procedure.
> |Ditto.
> But why?

Because if the procedures are in a library, separating them means you
can load only one into a program which needs to do only one thing. AP
coding (as in whap) makes a good example: the compressor is rather
complex and takes quite a bit of code space, but the decompressor is
trivial. A lot of programs just want to read compressed files...

> Imposed standards can and do work all the time (witness Ada, CDs).

Huh? Ada was indeed standardized before it was implemented, and it is a
royal flop. Very few people use it. The CD ``standard'' is just a
description of what people are doing; it works because it was
implemented first.

> Most standards have
> problems, but at least they give you something to kick off.

No. Most *implementations* have problems---most importantly,
incompleteness---but at least they do something positive. Standards are
negative. They restrict you, and their main failure is being overly
complete.

> I don't see how I can "IMPLEMENT" my standard. I get the impression
> that you think that in order for me to do that, I would have to
> implement a work harness, a validation harness, and create conformant
> implementations of dozens of algorithms.

I don't know about Graham. I think you should just prove that the
established compressors can be recoded into your interface without too
much work.

> (As it happens I have been using a standard similar to the one
> proposed (see lzrw_headers.h in my ftp archive on
> sirius.itd.adelaide.edu in pub/compression) for two years and have
> implemented five or six algorithms using it (LZRW1 is such an
> algorithm). These algorithms were implemented with the standard in
> mind though.)

Right. Which is why you should choose something like LZW, which does not
put a bound on output (well, 8x or so in theory, but not a fixed number
of bytes), and would have a much smaller user base if it were not
parametrized, and is naturally a stream algorithm.

  [ separating compressor and decompressor ]
> Is it one of the ten commandments or something?

Yes.

> I have been using "standard", "standard proposal", etc in an
> undisciplined manner. I do not expect it to be adopted instantly. I
> just want to develop something that everyone can agree on.

I don't care whether you call it a ``standard'', a ``proposed
standard'', a ``document'', a ``suggestion'', or an ``article''. I am
not going to write code to someone else's interface unless I can see
what its strengths and weaknesses are *in practice*. (Naturally, if you
develop a large enough body of code that takes advantage of the
interface somehow, that's a strong point.)

---Dan