Path: utzoo!mnetor!uunet!husc6!sri-unix!quintus!ok From: ok@quintus.UUCP (Richard A. O'Keefe) Newsgroups: comp.lang.prolog Subject: Re: Strings Message-ID: <776@cresswell.quintus.UUCP> Date: 15 Mar 88 07:56:32 GMT References: <512@ecrcvax.UUCP> <768@cresswell.quintus.UUCP> <5348@utah-cs.UUCP> Organization: Quintus Computer Systems, Mountain View, CA Lines: 122 In article <5348@utah-cs.UUCP>, shebs%defun.utah.edu.uucp@utah-cs.UUCP (Stanley T. Shebs) writes: > In article <768@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes: > > >When I say that it is more efficient to use lists of character codes > >rather than packed byte vectors, I am reporting empirical measurements. > This is a very interesting assertion, because it has a lot of implications > for *any* vector/array-type representation. The broadest interpretation > is that any and all vector-like types are bad. I deny this. The question is >how do you USE these things<. If you are doing lots of concatenation and deconcatenation (I stole this word from PL/I), lists are a win. How often do you concatenate bignums? How often do you concatenate matrices (APL programmers need not apply)? If you need random access into some collection, a vector-like structure is going to be a really good idea. The point is that most uses of STRINGS just aren't like that. (Hint: for which C data-type is the idiom *x++ most often used...) > The reason I wonder is that Lisp has not used lists to represent bignums > since the mid-60s, and (for instance) JonL White's bignum paper in the > 86 Lisp conference assumes a vector-like representation. Xerox Lisp uses lists for bignums to this very day. There is a book: Computer Algebra: Symbolic and Algebraic Computation (2nd ed) eds Buchberger, Collins, & Loos Springer-Verlag, 1982, 1983 ISBN 3-211-81776-X (can anyone tell me what it means ISBN 0-387-81776-X for a book to have 2 ISBNs?) Price US$ 32. I bought this book mainly because of the chapter Arithmetic in Basic Algebraic Domains Collins, Mignotte, & Winkler which describes a number of algorithms Knuth didn't. This chapter claims quite strongly that list representation is better for bignums. The fact that one seldom if ever accesses the "elements" of a bignum in other than strict sequential order suggests that this may be a good idea. Consider the fact that if you have an N-"bigit" number X, X and X+k for small values of k can almost always share the most significant (N-1) "bigits". > Boxing characters is just as bad as boxing small integers, and just as > unnecessary. Avoiding the creation of a new string descriptor is harder, > and requires compiler analysis, but it is possible under many circumstances. I used "boxing" loosely. I did not, as Shebs excusably thinks, mean "allocating storage for", but "suitably locating in a register and supplying with an appropriate tag". This is cheap, but it isn't free. > I think the real underlying motivation for strings is that languages like C > can do string processing with essentially zero space and time overhead, C can do this precisely because it ****HASN'T**** got a string data type! For example, suppose I want to concatenate four chunks of text in C. I can do char *strmov(register char *dst, register char *src) { while (*dst++ = *src++) ; return dst-1; } void conc4(char buffer[], char *a, char *b, char *c, char *d) { (void) strmov( strmov( strmov( strmov( buffer, a), b), c), d); } Time overhead: each character is examined exactly once, as it is moved. no check for overflow or overlap. Space overhead: static allocation of result area. If you look closely, you will notice that there are no strings in that! There are only pointers into buffers. In Fortran 77, concatenation can be efficient because, once again, Fortran 77 hasn't got strings, only buffers. For example, CHARACTER *(80) BUFFER CHARACTER *(20) A, B, C, D BUFFER = A//B//C//D There is no storage allocation. (Unlike C, there *is* a check for overflow, but in this case it can be done at compile time.) ADA is similar to Fortran 77: declare a, b, c, d : string(1..20); buffer : string(1..a'length+b'length+c'length+d'length); begin buffer := a & b & c & d; end; The key to not allocating storage is not to have any operations which create string values, but only commands which change the contents of an existing buffer. (At the source level you may have operations which look as though they create string values, but the implementation mustn't actually do that.) One thing that helps in Fortran 77 and ADA is saying in advance how big a string each buffer can hold. C obtains a similar advantage by not bothering to check. > It's not so much an issue of doing one operation on German titles, but of > passing a million characters of text through a program without allocating > so much as one byte of storage dynamically. Well, if *that* is what you want, copy_chars :- get0(Char), copy_chars(Char). copy_chars(-1) :- !. copy_chars(Char) :- put(Char), copy_chars. will pass any number of characters through a program without allocating so much as one byte of storage dynamically. (:-) Actually, if that's what you want, AAIS Prolog may be the language for you. I forget the details, but it has an operation something like setchar(+Index, +String, +Char) which changes the Indexth character of String to Char. So in that language you can have all the fun of buffers (and all the pain of buffer overflow).