Path: utzoo!mnetor!uunet!husc6!sri-unix!quintus!ok
From: ok@quintus.UUCP (Richard A. O'Keefe)
Newsgroups: comp.lang.prolog
Subject: Re: Strings
Message-ID: <776@cresswell.quintus.UUCP>
Date: 15 Mar 88 07:56:32 GMT
References: <512@ecrcvax.UUCP> <768@cresswell.quintus.UUCP> <5348@utah-cs.UUCP>
Organization: Quintus Computer Systems, Mountain View, CA
Lines: 122

In article <5348@utah-cs.UUCP>, shebs%defun.utah.edu.uucp@utah-cs.UUCP (Stanley T. Shebs) writes:
> In article <768@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
> 
> >When I say that it is more efficient to use lists of character codes
> >rather than packed byte vectors, I am reporting empirical measurements.

> This is a very interesting assertion, because it has a lot of implications
> for *any* vector/array-type representation.  The broadest interpretation
> is that any and all vector-like types are bad.

I deny this.  The question is >how do you USE these things<.
If you are doing lots of concatenation and deconcatenation (I stole this
word from PL/I), lists are a win.  How often do you concatenate bignums?
How often do you concatenate matrices (APL programmers need not apply)?

If you need random access into some collection, a vector-like structure
is going to be a really good idea.  The point is that most uses of STRINGS
just aren't like that.  (Hint:  for which C data-type is the idiom *x++
most often used...)

> The reason I wonder is that Lisp has not used lists to represent bignums
> since the mid-60s, and (for instance) JonL White's bignum paper in the
> 86 Lisp conference assumes a vector-like representation.

Xerox Lisp uses lists for bignums to this very day.
There is a book:
	Computer Algebra: Symbolic and Algebraic Computation (2nd ed)
	eds Buchberger, Collins, & Loos
	Springer-Verlag, 1982, 1983
	ISBN 3-211-81776-X		(can anyone tell me what it means
	ISBN 0-387-81776-X		for a book to have 2 ISBNs?)
	Price US$ 32.
I bought this book mainly because of the chapter
	Arithmetic in Basic Algebraic Domains
	Collins, Mignotte, & Winkler
which describes a number of algorithms Knuth didn't.
This chapter claims quite strongly that list representation is better
for bignums.  The fact that one seldom if ever accesses the "elements"
of a bignum in other than strict sequential order suggests that this
may be a good idea.

Consider the fact that if you have an N-"bigit" number X,
X and X+k for small values of k can almost always share the most
significant (N-1) "bigits".

> Boxing characters is just as bad as boxing small integers, and just as
> unnecessary.  Avoiding the creation of a new string descriptor is harder,
> and requires compiler analysis, but it is possible under many circumstances.

I used "boxing" loosely.  I did not, as Shebs excusably thinks, mean
"allocating storage for", but "suitably locating in a register and
supplying with an appropriate tag".  This is cheap, but it isn't free.

> I think the real underlying motivation for strings is that languages like C
> can do string processing with essentially zero space and time overhead,

C can do this precisely because it ****HASN'T**** got a string data type!
For example, suppose I want to concatenate four chunks of text in C.
I can do
	char *strmov(register char *dst, register char *src)
	    {
		while (*dst++ = *src++) ;
		return dst-1;
	    }

	void conc4(char buffer[], char *a, char *b, char *c, char *d)
	    {
		(void) strmov( strmov( strmov( strmov( buffer,
			a), b), c), d);
	    }
Time overhead: each character is examined exactly once, as it is moved.
		no check for overflow or overlap.
Space overhead: static allocation of result area.

If you look closely, you will notice that there are no strings in that!
There are only pointers into buffers.

In Fortran 77, concatenation can be efficient because, once again,
Fortran 77 hasn't got strings, only buffers.  For example,
	CHARACTER *(80) BUFFER
	CHARACTER *(20) A, B, C, D
	BUFFER = A//B//C//D
There is no storage allocation.  (Unlike C, there *is* a check for
overflow, but in this case it can be done at compile time.)
ADA is similar to Fortran 77:
	declare
	    a, b, c, d : string(1..20);
	    buffer : string(1..a'length+b'length+c'length+d'length);
	begin
	    buffer := a & b & c & d;
	end;
The key to not allocating storage is not to have any operations which
create string values, but only commands which change the contents of
an existing buffer.  (At the source level you may have operations which
look as though they create string values, but the implementation mustn't
actually do that.)  One thing that helps in Fortran 77 and ADA is saying
in advance how big a string each buffer can hold.  C obtains a similar
advantage by not bothering to check.

> It's not so much an issue of doing one operation on German titles, but of
> passing a million characters of text through a program without allocating
> so much as one byte of storage dynamically.

Well, if *that* is what you want,

	copy_chars :-
		get0(Char),
		copy_chars(Char).

	copy_chars(-1) :- !.
	copy_chars(Char) :-
		put(Char),
		copy_chars.

will pass any number of characters through a program without allocating
so much as one byte of storage dynamically.   (:-)

Actually, if that's what you want, AAIS Prolog may be the language for you.
I forget the details, but it has an operation something like
	setchar(+Index, +String, +Char)
which changes the Indexth character of String to Char.  So in that language
you can have all the fun of buffers (and all the pain of buffer overflow).