Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!qt.cs.utexas.edu!yale.edu!cmcl2!kramden.acf.nyu.edu!brnstnd From: brnstnd@kramden.acf.nyu.edu (Dan Bernstein) Newsgroups: comp.compression Subject: Re: Proposed standard: Replies to criticisms. Message-ID: <12900.Jun2722.22.2191@kramden.acf.nyu.edu> Date: 27 Jun 91 22:22:21 GMT References: <899@spam.ua.oz> Organization: IR Lines: 320 I don't mean to be cruel, Ross, but I suggest that we all ignore your ``standard'' until you have squashed two or three separate compressors (LZW, AP, and LZRW1, say) into the same mold and published the results. This is how all standards should work: *Do something first*, and publish it. Then tell us about the important parts of the interface, cutting out irrelevant details that might take up a lot of code and spending time on important restrictions that don't involve any single section of code. Say ``the routine must do blah'' if you think blah is necessary for the user. Say ``the user must not do foo'' if you don't think anyone will ever want foo in routines written to the standard. Say ``the routine may or may not do bar'' if you did it both ways. Voila, a standard. None of this goes anywhere unless you first publish working code. In article <899@spam.ua.oz> ross@spam.ua.oz.au (Ross Williams) writes: [ memory ] The obvious solution is for the compression routine to provide its own memory. This doesn't fit well with I/O-driven compression, though, or with non-parametrized standards. [ pointers ] Face it, Ross: There are a whole lot of compressors which cannot be written efficiently in Pascal or even Ada. ``An interface that even fools can use will be used only by fools''; well, by making your standard wimpier and wimpier so that the routines can be run on wimpier and wimpier machines in wimpier and wimpier languages, you pretty much guarantee that it will only be used that way. If you're going to write the standard before the implementation, you might as well take a lesson from POSIX: Use pointers where they obviously belong, don't make design decisions based on how difficult you think the interface will be to use in Ada, and postpone other language bindings to the next decade. > Perhaps the algorithm could advertise how much heap and how much stack > it will use. An excellent suggestion. (Naturally, my yabbawhap package includes a ``checkconf'' program which advertises this information, given the compile parameters.) > |Ofcourse there this standard code has to be ported to a number of platforms > |but since almost every archive programm is ported anyway this is not much > |of a problem. > |Identification codes have to be obtained from some central point where all > |the codes are registered. > No need. Just get everyone to generate random n-bit IDs. Make n large > enough so that a collision is extremely unlikely. Great. Now IDs are almost totally useless. What are you trying to accomplish with those numbers? [ parameters ] > The reason why I am so heavy on parameters I think stems from seeing > dozens of papers in which they are not specified properly. Writing a crippled compression standard is a poor way to express your dissatisfaction with how papers are written. > |> If a user wants to create another tuning, he can just fiddle with the > |> source code for the algorithm until he is happy and then fiddle the > |> identity record to create a new algorithm. > |Uh-uh. This contradicts your stated goal of having the identification > |numbers mean something for comparison purposes. > Absolutely not, because if the user changed anything he would have to > (according to standard) generate a new algorithm name and ID. Sorry I didn't make this clear. I publish yabba with a default of -m65533. I give it some ID. Joe Shmoe changes the default to -m300000 and gives it another ID. Sally Shmoe changes the default to -m300000 and gives it yet another ID. Now if Joe and Sally choose their IDs randomly, how do we tell that they're using the same method? Or, for that matter, that they've just changed a parameter in my method? > |If you want to achieve that goal, just insist that everyone name all the > |parameters (in my case, -m/-z/-Z) when quoting compression results. (It > |gets even hairier when you're measuring speed and memory: for yabba, > |you'd have to quote MOD and PTRS as well as the machine type.) Don't > |force parameterized coders into a parameterless model. > Well, maybe. But the algorithm designer has to commit himself on > parameters to some extent. SAKDC has 25 parameters and I can dial up a > multitude of adaptivities and memories and different order models. I fail to see how this matters. The algorithm is parametrized. Period. > |> (Another argument is that many algorithms can be implemented more > |> efficiently if their parameters are statically specified). > |Hardly. > Absolutely. You should read the other critics. I have you telling me > that dynamic parameters are not much less efficient than static ones. > Meanwhile, there's the mob who don't like passing a pointer to > read/write memory for the algorithm to use because they want to avoid > indirect addressing in their code and pull tricks with the address > range. Right. Remember, we're talking about *can be implemented*, not can be implemented under your awful set of restrictions. Someone might have a compression method where he compares n to a parameter, x, in the innermost loop. Does that mean that the algorithm can be implemented more efficiently if their parameters are statically specified? No, because every use of n can be ``flipped'' into x - n, and this can be compared against 0. Sure, there will exist parameters (like the hash function) where you just can't do something like this. You still don't have to make those parameters fixed. You can just tell the compiler to optimize out a separate version for those parameter values: int my_special_hash(h,c) int h; int c; { ... } static int generic_routine(input,hash) char *input; int (*hash)(); { ... all the compression code, parameterized on ``hash'', goes here ... } static int optimized_routine(input) char *input; { #pragma begin-inline-fun generic_routine return generic_routine(input,my_special_hash); #pragma end-inline-fun generic_routine } int user_visible_routine(input,hash) char *input; int (*hash)(); { if (hash == my_special_hash) return optimized_routine(input); return generic_routine(input,hash); } Of course, this is oversimplified, and with current C compilers you have to stick real_routine inside one big, ugly (but portable) macro instead of using pragmas. But it works, it avoids code duplication, and it gives you all the benefits of fixing the parameter to begin with. In fact, if the compiler inlines something like user_visible_routine(myinput,my_special_hash) in the user code, it will end up *exactly* as efficient as if my_special_hash had been used everywhere in the first place. The only disadvantage is the extra space needed for generic_routine, and that can be avoided if you stick the routines inside a library to begin with. I believe the MIPS compilers can do all of the above, including inlining from ucode libraries. They don't have the #pragma but that's just syntactic sugar. Even under good old pcc, all you lose for the parametrization is a comparison and an extra function call on each call to the routine. > It is a fact that statically binding parameters allows lots of > optimizations. Indeed... I don't complain that something can't be done. I just do it. Are you so sure now that taking parameters out of the interface allows better optimization? > * They tend to confuse users who don't know how to set the > parameters. As long as every parameter has a default, this is moot. > * They must be algorithm specific. So what? > * They can overflow and be out of range and all the other things. True. Another advantage of a program-based interface is that it can take the time to check such things. > The idea of having default settings and then allowing the user to > specify otherwise at his own risk dooesn't sound so bad so long as: > * The settings do not affect memory consumption. There you go again. How about you try to fit compress into this model. If you manage to write one version which doesn't take too much memory for -b12, but allows -b16, but doesn't have a memory-affecting parameter, and doesn't allocate memory itself anyway, I'll be very impressed. [ compressor driving I/O versus I/O driving compressor ] > This may be so, but placing the compressor in charge means that you > have to tell it how to do IO, which is so messy in most programming > languages (requiring Ada like generic superroutines or C pointers to > IO functions) as to as to completely destroy the standard. See my general comments near the top. You have that choice: either use a language which can compete with machine language, or make your standard so wimpy that nobody will want to use it. > I'm not saying that there won't be a little pain in converting some > algorithms over to the standard (all good standards cause lots of pain > :-), but I am sure that there will be very little loss of efficiency > and great benefits. So do it for compress. I want to see something that works just like compress but uses your standard for an internal interface. > |> I am wary of a stream interface because it is then impossible to place > |> a bound on the amount of output one could get from feeding in a single > |> byte of input (during either compression or decompression). > |Again it sounds like you're focusing on modems. Why? > I'm NOT focussing on modems - you're the one who keeps mentioning them!!! > My logic goes something like this: > - We NEED to make the algorithm a procedure (not a process). > - We NEED to use memory blocks to feed bytes in and out. > (although the blocks can be part of a stream interface). > (I hope you agree that the alternatives to these are horrific) No, I don't agree with either one. The argument here is about the second. You only need a bound on output because you have a block interface, and unless you wanted to handle modems you could use a stream interface instead. There's nothing horrific about ignoring modems and sticking to a stream interface. > |One advantage of this for C is that in those cases where you want to > |hard code the parameters you can #define them in the source of the > |compression function without changing the function's formal > |parameters. (It seemed that Dan doesn't believe there can be an > |advantage in hard coding the parameters, which I find peculiar. There > |are many optimizations that can be done with known values over > |variables.) > If the user has the source code, he can do anything anyway. The real > case for parameters lies in the case of someone who only has an > executable. See above. It doesn't matter whether you have the entire source or just a library. Hard-coding parameters barely speeds anything up. > Agreed. If the standard does allow parameters, they must go through > the front door. What do you think of a string interface? E.g: > compress(comp,mem,inblk,outblk,"M=4,K=91,Z=512"); Not too bad. > I really don't think there would be any problem with implementing UNIX > compress if the flush flag were restored. As I understand it, the LZW > algorithm's context is very simple consisting of just a table, an > index and possibly one or two coding scalars. Well, don't just understand it. Do it. Then tell us what you did. > As a separate point, I don't see how a program-based standard could be > specified at this level without assuming that languages have processes > or superroutines or procedures as parameters. Hmmm? What do you mean by that? > |> I'd also like to repeat someone else's suggestion of seperating the > |> compression procedure and decompression procedure. > |Ditto. > But why? Because if the procedures are in a library, separating them means you can load only one into a program which needs to do only one thing. AP coding (as in whap) makes a good example: the compressor is rather complex and takes quite a bit of code space, but the decompressor is trivial. A lot of programs just want to read compressed files... > Imposed standards can and do work all the time (witness Ada, CDs). Huh? Ada was indeed standardized before it was implemented, and it is a royal flop. Very few people use it. The CD ``standard'' is just a description of what people are doing; it works because it was implemented first. > Most standards have > problems, but at least they give you something to kick off. No. Most *implementations* have problems---most importantly, incompleteness---but at least they do something positive. Standards are negative. They restrict you, and their main failure is being overly complete. > I don't see how I can "IMPLEMENT" my standard. I get the impression > that you think that in order for me to do that, I would have to > implement a work harness, a validation harness, and create conformant > implementations of dozens of algorithms. I don't know about Graham. I think you should just prove that the established compressors can be recoded into your interface without too much work. > (As it happens I have been using a standard similar to the one > proposed (see lzrw_headers.h in my ftp archive on > sirius.itd.adelaide.edu in pub/compression) for two years and have > implemented five or six algorithms using it (LZRW1 is such an > algorithm). These algorithms were implemented with the standard in > mind though.) Right. Which is why you should choose something like LZW, which does not put a bound on output (well, 8x or so in theory, but not a fixed number of bytes), and would have a much smaller user base if it were not parametrized, and is naturally a stream algorithm. [ separating compressor and decompressor ] > Is it one of the ten commandments or something? Yes. > I have been using "standard", "standard proposal", etc in an > undisciplined manner. I do not expect it to be adopted instantly. I > just want to develop something that everyone can agree on. I don't care whether you call it a ``standard'', a ``proposed standard'', a ``document'', a ``suggestion'', or an ``article''. I am not going to write code to someone else's interface unless I can see what its strengths and weaknesses are *in practice*. (Naturally, if you develop a large enough body of code that takes advantage of the interface somehow, that's a strong point.) ---Dan