Path: utzoo!mnetor!uunet!husc6!sri-unix!quintus!ok From: ok@quintus.UUCP (Richard A. O'Keefe) Newsgroups: comp.lang.prolog Subject: Re: behavior of read/get0 at end_of_file Message-ID: <783@cresswell.quintus.UUCP> Date: 17 Mar 88 09:57:15 GMT References: 608 <1197@kulcs.kulcs.uucp> Organization: Quintus Computer Systems, Mountain View, CA Lines: 221 Keywords: get0 read end_of_file article <1197@kulcs.kulcs.uucp>, by bimandre@kulcs.uucp (Andre Marien) [signed by Andre Marien and Bart Demoen both] arrived mangled at our site. What we got started like this: > > > copy_chars :- > > get0(Char), > > copy_chars(Char). > > > > copy_chars(-1) :- !. > > copy_chars(Char) :- > > put(Char), > > copy_chars. > > At the benchmark workshop he agitated vividly against the different > behavior of BIM_Prolog in case of end of file. > BIM_Prolog fails when get0 attempts to read past end of file instead of > returning -1. The same is true for read. > If you write the same program as above with the BIM_prolog convention, > this it what it looks like : > > copy_chars :- > get0(Char), !, > put(Char), > copy_chars. > copy_chars . > > Now this looks so obviously better and more readable to me, > that it convinces me we made the better choice. It would be inaccurate to say that I have ever "agitated ... against ... BIM_Prolog". It would, however, be accurate to say that I had some rather harsh things to say about the quondam intention of the BSI Prolog group to make get0/1 behave in this way. I don't know what the BSI group currently intend. I should apologise for not having taken greater care with the presentation of copy_chars/0; but then in the context of my original message it was part of a joke. Here's how I'd write copy_chars/0 for real: copy_chars :- get0(Char), ( is_endfile(Char) -> true ; put(Char), copy_chars ). This is only superficially different from C's while ((Char = getchar()) != EOF) putchar(Char); Which version is "so obviously better and more readable"? I'm not going to try to answer that question, because it is entirely the wrong question. An example this small can be coped with even if it is badly written. The right question is: are there any objective reasons for preferring one approach to the other? Let's face it, a very important consideration for Quintus is that we take money from people in return for something we claim to be an Edinburgh- compatible Prolog. So we have three choices: 1. have get0/1 return a special code (26 as in DEC-10 Prolog, or -1 as in C-Prolog; we picked the latter) and have read/1 return a special code, because that's what real Edinburgh Prologs do, or 2. stop claiming to be Edinburgh compatible, or 3. deliberately lie to our customers. I think it's clear that number 1 is a defensible choice. So that's a reason why a Prolog vendor who claims to provide an Edinburgh-compatible Prolog should do what we do. But is there any reason why this is a *good* thing to do in itself? Yes, there is. With the possible exception of applying-a-function-to-some-arguments, an operation on its own isn't much good. You need a method for using it, and it has to fit well with the other operations available. In particular, the question of how we read a single character isn't all that interesting: what's interesting is "how do we write a program that reads a lot of characters". For example, how do you write a tokeniser for Prolog or Pascal or whatever in Prolog? Where would you look for advice about how to write programs that do input? Where but in a book about compilers. And a good book on that topic is the "Dragon" book: Compilers: Principles, Techniques, and Tools Aho, Sethi, & Ullman Addison-Wesley 1986 ISBN 0-201-10088-6 To learn how to write a tokeniser, we might look at 'lexer.c' in section 2.9 (page 74). You will note that this program relies on the fact that C streams end with an "EOF" end-marker. Perhaps more seriously, the whole "transition diagram"/"finite state automaton" approach described in chapter 3 relies on "end of file" being a source character like any other. (See fig 3.22, for example.) Indeed, end-markers are so much taken for granted in parsing that "$" crops up in chapter 4 without much introduction. We can convert a deterministic finite-state automaton to Edinburgh Prolog with very little effort. We represent a state of the automaton by a predicate with one argument: the next character. We represent an arc of the automaton by a clause. For example, the arcs s1: a -> s2. s1: b -> s1. s1: $ -> accept. would be coded like this: s1(0'a) :- get0(Next), s2(Next). s1(0'b) :- get0(Next), s1(Next). s1(- 1) :- true. The correspondence is exact. I think this is an important practical point. I offer the following anecdote as evidence: a year after learning Prolog I decided that I wanted to write a Prolog system of my own, but couldn't figure out how to write the tokeniser, so I abandoned the entire project. Two years later, I was talking to someone and suggested this approach as a method for writing programs that use get0/1, and suddenly it dawned on me that it would work. One evening of coding later, I had a Dec-10 Prolog tokeniser written in Dec-10 Prolog. If you need some lookahead, you simply add extra arguments to a few states to carry the looked-ahead characters. What would it do to us if get0/1 failed at the end of a file? Unpleasant things: the test for end of file has to be pushed up into the states *before* the state which expects end of file. Every arc of the form sN(X) :- get0(Next), s1(Next). has to be rewritten as sN(X) :- bad_get0(Next), !, s1(Next). sN(X) :- s1_eof. and s1 has to be written as s1(0'a) :- bad_get0(Next), s2(Next). s1(0'b) :- bad_get0(Next), !, s1(Next). s1(0'b) :- /* EOF */ s1_eof. s1_eof :- true. The cleanest way to avoid this mess is to use a get0/1 which returns an end-marker, and if your vendor won't provide you with one, you'll have to write one yourself. +------------------------------------------------------------------------+ | An important reason for get0/1 returning an end-marker at | | the end of a stream is that this forms part of a practice | | of writing character-reading code. | +------------------------------------------------------------------------+ Note that this practice existed prior to Prolog: we didn't have to figure out anything new. Is there any other reason for preferring the Edinburgh Prolog version of get0/1? Yes. Suppose I have a goal bad_get0(X) and it fails. Does that mean that the end of the stream has been reached? ***NO***. It means that *EITHER* the end of the stream has been reached *OR* a character has been read which didn't happen to unify with X. Is there anything we can do afterwards to find out which was the case? No. Assuming, for the sake of argument, a predicate at_eof/0 which succeeds when we are at the end of the current input stream, ... ( bad_get0(X) -> /* character read */ ; at_eof -> /* now at end of stream */ ; /* otherwise it was unification failure */ ) ... isn't quite right. If there was precisely one character left in the input stream, bad_get0(X) will consume it, and if the unification fails, at_eof will *now* see the end of the file. The problem is that bad_get0/1 has a side-effect (consuming one character from the current input stream) which it SOMETIMES does and sometimes doesn't do. It's similar to playing Russian roulette, except that the gun is pointed at the foot rather than the head. It is interesting that Marien and Demoen fell into exactly this trap. They say: > > BTW, if you ever want to convert a program with a different interpretation, > the solution is easy : > > /*QP*/read(X) :- /*bim*/read(X), ! . > /*QP*/read(whatever_is_used_to_indicate_end_of_file) . > It may be easy, but it isn't a solution. Suppose we write this: buggy_read(Term) :- bim_read(Term), !. buggy_read(end_of_file). Now, suppose the current input stream contains fred. and we call buggy_read(end_of_file). IT WILL SUCCEED! It should have failed. Now, have I shown that the BIM approach is bad? No. What I have shown is that the end-marker approach which Quintus Prolog inherited from C Prolog (which got it from DEC-10 Prolog, which I think got it from Pop-2) is accompanied by a straightforward discipline for using it to write tokenisers. (In fact, it's exactly the same approach you use in C.) I am not aware of a similar methodology for the end-failure approach, despite having asked Bart Demoen at the Prolog Benchmarking Workshop for enlightenment on this point. Again, this doesn't mean that there isn't any such methodology, only that I don't know what it might be. If there is a straightforward way of turning deterministic transition diagrams into end-failure code, I would be pleased to be instructed in it.