Path: utzoo!mnetor!uunet!husc6!ncar!ames!eos!aurora!labrea!sri-unix!quintus!ok From: ok@quintus.UUCP (Richard A. O'Keefe) Newsgroups: comp.lang.prolog Subject: Re: behavior of read/get0 at end_of_file Message-ID: <801@cresswell.quintus.UUCP> Date: 23 Mar 88 09:54:10 GMT References: <608> <1197@kulcs.kulcs.uucp> <783@cresswell.quintus.UUCP> <518@ecrcvax.UUCP> Organization: Quintus Computer Systems, Mountain View, CA Lines: 177 Keywords: get0 read end_of_file In article <518@ecrcvax.UUCP>, micha@ecrcvax.UUCP (Micha Meier) writes: > By the way, get0/1 does *not* exist in BSI, it uses get_char/1 instead, > and its argument is a character, i.e. a string of length 1. > This means that the type 'character' is inferred from > the type 'string' (and not the other way round like in C). > Does anybody out there know what advantages this can bring? > It is independent on the character <-> integer encoding, > but this only because explicit conversion predicates have > to be called all the time. I find it extremely odd to call a string of length one a character. It's like calling a list of integers which contains one element an integer. Do we call an array with one element a scalar? I haven't commented on the BSI's get_char/1 before because for once they have given a new operation a new name. There are two problems with it. A minor problem is that the result being a string, they can't represent end of file with an additional character, so the fail-at-end approach is hard to avoid. (Not impossible.) There is an efficiency problem: something which returns an integer or a character constant can just return a single tagged item, but something which returns a string either has to construct a new string every time, or else cache the strings somehow. For example, Interlisp has a function which returns you the next character in the current input stream, represented as an atom with one character in its name. (Well, almost: characters `0`..`9` are represented by integers 0..9.) This was quite attractive on a DEC-20, where you could just compute a table of 128 atoms once and for all. It wasn't too bad on VAXen either, where the table had to have 256 elements. But it because rather more clumsy on the D machines, which have a 16-bit character set. (Can you say "Kanji"? I knew you could.) So the alternatives I can see at the moment are o construct a new string every time. o precompute 2^16 strings. o cache 2^8 strings, and construct a new string every time for Kanji and other non-Latin alphabets. o not support Kanji or other non-Latin alphabets at all. (Can you say "Cyrillic"? How about "Devanagari"? You may need the assistance of a good dictionary; I used to mispronounce "Devanagari", and probably still do.) I wrote that > >For example, the arcs > > s1: a -> s2. > > s1: b -> s1. > > s1: $ -> accept. > >would be coded like this: > > s1(0'a) :- get0(Next), s2(Next). > > s1(0'b) :- get0(Next), s1(Next). > > s1(- 1) :- true. Meier says that > In his tutorial to the SLP '87 Richard has taken another > representation of a finite automaton which is more appropriate: > s1 :- > get0(Char), > s1(Char). > > s1(0'a) :- > s2. > s1(0'b) :- > s1. > s1(-1) :- > accept. There wasn't time to go into this in detail in the tutorial, but it should be obvious that the first approach is more general: in particular it can handle transitions where (perhaps because of context) no input is consumed, and it can handle lookahead. > Such representation can > be more easily converted to the BSI's variant of get: > s1 :- > % do the corresponding action > ( get0(Char) -> s1(Char) > ; > accept > ). This doesn't generalise as well as the end-marker version. Here is the kind of thing one is constantly doing: rest_identifier(Char, [Char|Chars], After) :- is_csymf(Char), !, get0(Next), rest_identifier(Next, Chars, After). rest_identifier(After, [], After). See how this code can treat the end marker just like any other character: because it doesn't pass the is_csymf/1 test (copied from Harbison & Steele, by the way) we'll pick the second clause, and there is no special case needed for an identifier which happens to be at the end of a stream. The fail-at-end approach forces us not only to do something special with the get0/1 in rest_identifier/3, but in everything that calls it. (In the Prolog tokeniser, there are two such callers.) The point is that if-then-elses such as Meier suggests start appearing all over the place like maggots in a corpse if you adopt the fail-at-end approach, to the point of obscuring the underlying automaton. > I must say, none of the two seems to me satisfactory. Richard's > version is not portable due to the -1 as eof character. If the standard were to rule that -1 was the end of file character, it would be precisely as portable as anything else in the standard! In strict point of fact, the Prolog-in-Prolog tokeniser was written in DEC-10 Prolog for DEC-10 Prolog, and used 26 as the end of file character, and 31 as the end of line character. It took 5 minutes with an editor to adapt it to Quintus Prolog. I wish C programs written for UNIX took this little effort to port! > for a Prolog system it is better to have get0/1 return > some *portable* eof (e.g the atom end_of_file, for get0/1 > there can be no confusion with source items) instead of > some integer. It is important that the end-of-file marker, whatever it is, should be the same kind of thing, in some sense, as the normal values, so that classification tests such as is_lower/1, is_digit/1, and so on will just fail quietly for the end-of-file marker, not report errors. Since end of file is rare, we would like to test the other cases first. Pop-2 on the Dec-10 returned integers almost all the time, except that at the end of a stream you got an end-of-file object which belonged to another data type (there was only one element of that data type, and it printed as ^Z). This was in practice a major nuisance, because before you could do anything other than an equality test with the result, you had to check whether it was the end of file mark. I have been giving out copies of the Prolog-in-Prolog tokeniser to show how easy it is to program character input with the Edinburgh Prolog approach. If someone would give me a tokeniser for BSI Prolog written entirely in BSI Prolog using the fail-at-end approach, and if that tokeniser were about as readable as the Prolog-in-Prolog one, that would go a long way towards convincing me that fail-at-end was a good idea. > BSI objects that if [read/1] returns e.g. the atom end_of_file > then any occurrence of this atom in the source file > could not be distinguished from a real end of file. That's not a bug, it's a feature! I'm serious about that. At Edinburgh, I had the problem that if someone asked me for help with Prolog, they might be using one of four different operating systems, where the end of file key might be ^Z or ^D or ^Y or something else which I have been glad to forget. No problem. I could always type end_of_file. to a Prolog listener, and it would go away. Oh, this was so nice! In fact, on my SUN right now I have function key F5 bound to "end_of_file.\n" so that I can get out of Prolog without running the risk of typing too many of them and logging out. Another thing it is useful for is leaving test data in a source file. One can do end_of_file. and include the test cases in the program or not just by moving the end_of_file around. Ah, you'll say, but that's what nested comments are for! Well no, they don't work. That's right, "#| ... |#" is NOT a reliable way of commenting code out in Common Lisp, and "/* ... */" is NOT a reliable way of commenting code out in PopLog. But end_of_file, in Edinburgh Prolog, IS a reliable way of commenting out the rest of the file. > In this case, a remedy would be the introduction of Prolog needs a remedy for end_of_file like Elizabeth Schwarzkopf needs a remedy for her voice. Before taking end_of_file away from me, the BSI committee should supply me with a portable way of exiting a break level and a reliable method of leaving test cases in a file without having them always read.