Path: utzoo!mnetor!uunet!lll-winken!lll-tis!mordor!sri-spam!sri-unix!quintus!ok From: ok@quintus.UUCP (Richard A. O'Keefe) Newsgroups: comp.lang.prolog Subject: Re: Strings Message-ID: <768@cresswell.quintus.UUCP> Date: 14 Mar 88 05:39:31 GMT References: <512@ecrcvax.UUCP> Organization: Quintus Computer Systems, Mountain View, CA Lines: 128 In article <512@ecrcvax.UUCP>, micha@ecrcvax.UUCP (Micha Meier) writes: > In Prolog Digest V5.100 Richard O'Keefe writes > > "... if efficiency is your concern your are better off using lists > of character codes than strings." > > May I ask this to be precised a little bit more? I can't see why is it > faster to handle structured data on the global stack rather than directly > characters - I guess this applies only for special predicates, > e.g. returning all substrings of a string, but even there it's in > no way evident that using lists is faster. Several points to make: > I don't understand "precised". Most implementations of strings in Prolog ARE "structured data on the global stack". When I say that it is more efficient to use lists of character codes rather than packed byte vectors, I am reporting empirical measurements. I don't have to explain it: I'm just saying truthfully that in the cases I have measured lists were faster. There may be Prolog systems which implement lists so badly that this isn't true. I have to admit that I've only really tested concatenation and searching, not accessing single characters. > - when using lists of character codes, each time when > the character code or the next element is accessed, > it has to be dereferenced and its type must be tested On the contrary. A great advantage of using lists is the fact that you *don't* have to do anything to a list element *unless* you want to process that particular element. Contrast Chars0 = [Char1|Chars1] which dereferences-and-type-tests Chars0 (it's one operation) and assigns the two fields of Chars0 to Char1 and Chars1 *without* dereferencing or type testing Char1, with N1 is N0+1, nth_char(N1, String, Char1) where N1 has to be dereferenced-and-type-tested and checked for being in the right range, then the character has to be extracted and *boxed*, and finally assigned to Char1, also we had to do arithmetic on N0/N1. If you don't like the arithmetic (and who could blame you), consider instead the Turbo Prolog equivalent (taken from the first edition of the Turbo Prolog manual, so it's probably wrong). frontchar(String0, Char1, String1) Let's assume a conventional Prolog, rather than Turbo Prolog, which no doubt does something clever. String0 has to be dereferenced and type tested (just like a list cell!) then the first character has to be extracted and *boxed*, and a new string descriptor has to be created and *boxed* and assigned to String1. Assuming that a boxed character costs one cell and a string descriptor costs two cells (not unreasonable), walking down a string of N characters using frontchar/3 ends up using 3N cells for the characters and string tails *in addition* to the packed byte vector we started with, whereas the list version would have used 2N cells total. > - copying a list is certainly slower than copying a string; > for strings you might need a fixed size buffer which > can complicate the things a little bit of you want > strings of arbitrary length, but it is easier to get > another buffer than to get another global stack Why would you copy either? If you have a variable X which is bound to a string or list of character codes, and you want a variable Y to have the same value, just do Y = X. I don't understand "get another global stack". If you're worried about space, consider the fact that with the usual representations of lists and byte vectors, when you append something in front of a list you can share the old tail, but with a byte vector you have to make a copy. That is, with Chars1 a known list of N characters and Chars0 unknown, Chars0 = [Char1|Chars1] costs 2 cells and 1 time unit, but with String1 a known string of N characters and String0 unknown, frontchar(String0, Char1, String1) costs N+1 bytes + 2 cells for the string descriptor and N+k time units. (The version using nth_char/2 would turn over 2N cells.) You can use 1 cell for a string descriptor rather than 2, but then the cost of taking substrings goes way up. > - a list item needs at least 8 bytes, a character only one > (plus some fixed amount for each string, e.g. length, end or others). > If your string is longer than "aa", you need much more space > to handle it as a list, consequently more time due to swapping I have answered this above. You might think about the fact that the list representation will handle 16-bit characters (can you say "Kanji"? I knew you could) at no extra cost, but the alleged space advantage of byte vectors is chopped in half. > - garbage collection is easier for atoms than for strings > in the sense that there is only one reference to the name > of the atom, namely from the symbol table itself, but to > gather all accessible atoms one has to scan the stacks > and the program code anyway (unless you want to trail them > which is really not faster) This is an implementation detail. There is no reason why a Prolog system might have more pointers to an atom name than one. In order to find all accessible strings you have to scan the stacks too. > we still think that having strings is useful. Yes, but WHAT FOR? We've seen above that the time and space costs can go either way. Here's a specific example: assuming that your Prolog system supports the ISO 8859/1 character set, and that it does *not* reflect dpANS C's setlocale() and strcollate() functions as Prolog string operations, write a predicate which will take two text objects representing German-language book titles, and compare them using whatever the convention is in German libraries. To make it easier, we'll assume that there are no punctuation marks and that all numbers are spelled out. Do this entirely in Prolog, using only the operations you already have. (If you can stand it, try using the operations in the BSI draft.) make two versions of the predicate: one which uses byte vectors and one which uses lists, measure their space and time use, and show us the code, the test data, and the results. Those who do not understand Prolog are | Richard A. O'Keefe condemned to reinvent Lisp, poorly. | ...!sun!quintus!ok -- Those who do not understand Prolog are condemned to reinvent Lisp, poorly.