Path: utzoo!mnetor!uunet!lll-winken!lll-tis!mordor!sri-spam!sri-unix!quintus!ok
From: ok@quintus.UUCP (Richard A. O'Keefe)
Newsgroups: comp.lang.prolog
Subject: Re: Strings
Message-ID: <768@cresswell.quintus.UUCP>
Date: 14 Mar 88 05:39:31 GMT
References: <512@ecrcvax.UUCP>
Organization: Quintus Computer Systems, Mountain View, CA
Lines: 128

In article <512@ecrcvax.UUCP>, micha@ecrcvax.UUCP (Micha Meier) writes:
> In Prolog Digest V5.100 Richard O'Keefe writes
> 
> "... if efficiency is your concern your are better off using lists
> of character codes than strings."
> 
> May I ask this to be precised a little bit more? I can't see why is it
> faster to handle structured data on the global stack rather than directly
> characters - I guess this applies only for special predicates,
> e.g. returning all substrings of a string, but even there it's in
> no way evident that using lists is faster. Several points to make:
> 
I don't understand "precised".

Most implementations of strings in Prolog ARE "structured data on the
global stack".

When I say that it is more efficient to use lists of character codes
rather than packed byte vectors, I am reporting empirical measurements.
I don't have to explain it:  I'm just saying truthfully that in the
cases I have measured lists were faster.  There may be Prolog systems
which implement lists so badly that this isn't true.  I have to admit
that I've only really tested concatenation and searching, not accessing
single characters.

> 	- when using lists of character codes, each time when
> 	  the character code or the next element is accessed,
> 	  it has to be dereferenced and its type must be tested

On the contrary.  A great advantage of using lists is the fact that
you *don't* have to do anything to a list element *unless* you want
to process that particular element.  Contrast
	Chars0 = [Char1|Chars1]
which dereferences-and-type-tests Chars0 (it's one operation) and
assigns the two fields of Chars0 to Char1 and Chars1 *without*
dereferencing or type testing Char1, with
	N1 is N0+1,
	nth_char(N1, String, Char1)
where N1 has to be dereferenced-and-type-tested and checked for being
in the right range, then the character has to be extracted and *boxed*,
and finally assigned to Char1, also we had to do arithmetic on N0/N1.

If you don't like the arithmetic (and who could blame you), consider
instead the Turbo Prolog equivalent (taken from the first edition of
the Turbo Prolog manual, so it's probably wrong).
	frontchar(String0, Char1, String1)
Let's assume a conventional Prolog, rather than Turbo Prolog, which no
doubt does something clever.  String0 has to be dereferenced and type
tested (just like a list cell!) then the first character has to be
extracted and *boxed*, and a new string descriptor has to be created
and *boxed* and assigned to String1.  Assuming that a boxed character
costs one cell and a string descriptor costs two cells (not unreasonable),
walking down a string of N characters using frontchar/3 ends up using
3N cells for the characters and string tails *in addition* to the
packed byte vector we started with, whereas the list version would have
used 2N cells total.

> 	- copying a list is certainly slower than copying a string;
> 	  for strings you might need a fixed size buffer which
> 	  can complicate the things a little bit of you want
> 	  strings of arbitrary length, but it is easier to get
> 	  another buffer than to get another global stack

Why would you copy either?  If you have a variable X which is bound to a
string or list of character codes, and you want a variable Y to have the
same value, just do Y = X.  I don't understand "get another global stack".

If you're worried about space, consider the fact that with the usual
representations of lists and byte vectors, when you append something
in front of a list you can share the old tail, but with a byte vector
you have to make a copy.  That is, with Chars1 a known list of N
characters and Chars0 unknown,
	Chars0 = [Char1|Chars1]
costs 2 cells and 1 time unit, but with String1 a known string of N
characters and String0 unknown,
	frontchar(String0, Char1, String1)
costs N+1 bytes + 2 cells for the string descriptor and N+k time units.
(The version using nth_char/2 would turn over 2N cells.)
You can use 1 cell for a string descriptor rather than 2, but then the
cost of taking substrings goes way up.

> 	- a list item needs at least 8 bytes, a character only one
> 	  (plus some fixed amount for each string, e.g. length, end or others).
> 	  If your string is longer than "aa", you need much more space
> 	  to handle it as a list, consequently more time due to swapping
 
I have answered this above.  You might think about the fact that the
list representation will handle 16-bit characters (can you say "Kanji"?
I knew you could) at no extra cost, but the alleged space advantage of
byte vectors is chopped in half.

> 	- garbage collection is easier for atoms than for strings
> 	  in the sense that there is only one reference to the name
> 	  of the atom, namely from the symbol table itself, but to
> 	  gather all accessible atoms one has to scan the stacks
> 	  and the program code anyway (unless you want to trail them
> 	  which is really not faster)

This is an implementation detail.  There is no reason why a Prolog system
might have more pointers to an atom name than one.  In order to find all
accessible strings you have to scan the stacks too.

> we still think that having strings is useful.

Yes, but WHAT FOR?  We've seen above that the time and space costs can go
either way.

Here's a specific example:
    assuming that your Prolog system supports the ISO 8859/1 character set,
    and that it does *not* reflect dpANS C's setlocale() and strcollate()
    functions as Prolog string operations,

    write a predicate which will take two text objects representing
    German-language book titles, and compare them using whatever the
    convention is in German libraries.  To make it easier, we'll
    assume that there are no punctuation marks and that all numbers
    are spelled out.  Do this entirely in Prolog, using only the
    operations you already have.  (If you can stand it, try using
    the operations in the BSI draft.)

    make two versions of the predicate: one which uses byte vectors and
    one which uses lists, measure their space and time use, and show us
    the code, the test data, and the results.

Those who do not understand Prolog are	| Richard A. O'Keefe
condemned to reinvent Lisp, poorly.	| ...!sun!quintus!ok
-- 
Those who do not understand Prolog are
condemned to reinvent Lisp, poorly.