Path: utzoo!mnetor!uunet!husc6!sri-unix!quintus!ok
From: ok@quintus.UUCP (Richard A. O'Keefe)
Newsgroups: comp.lang.prolog
Subject: Re: behavior of read/get0 at end_of_file
Message-ID: <783@cresswell.quintus.UUCP>
Date: 17 Mar 88 09:57:15 GMT
References: 608 <1197@kulcs.kulcs.uucp>
Organization: Quintus Computer Systems, Mountain View, CA
Lines: 221
Keywords: get0 read end_of_file

article <1197@kulcs.kulcs.uucp>, by bimandre@kulcs.uucp (Andre Marien)
[signed by Andre Marien and Bart Demoen both] arrived mangled at our site.
What we got started like this:
> 
> >	copy_chars :-
> >		get0(Char),
> >		copy_chars(Char).
> >
> >	copy_chars(-1) :- !.
> >	copy_chars(Char) :-
> >		put(Char),
> >		copy_chars.
> 
> At the benchmark workshop he agitated vividly against the different
> behavior of BIM_Prolog in case of end of file.
> BIM_Prolog fails when get0 attempts to read past end of file instead of
> returning -1. The same is true for read.
> If you write the same program as above with the BIM_prolog convention,
> this it what it looks like :
> 
>         copy_chars :-
>                 get0(Char), !,
>                 put(Char),
>                 copy_chars.
>         copy_chars .
> 
> Now this looks so obviously better and more readable to me,
> that it convinces me we made the better choice.

It would be inaccurate to say that I have ever "agitated ... against ...
BIM_Prolog".  It would, however, be accurate to say that I had some
rather harsh things to say about the quondam intention of the BSI
Prolog group to make get0/1 behave in this way.  I don't know what
the BSI group currently intend.

I should apologise for not having taken greater care with the presentation
of copy_chars/0; but then in the context of my original message it was part
of a joke.  Here's how I'd write copy_chars/0 for real:

	copy_chars :-
		get0(Char),
		(   is_endfile(Char) -> true
		;   put(Char),
		    copy_chars
		).

This is only superficially different from C's
	while ((Char = getchar()) != EOF) putchar(Char);

Which version is "so obviously better and more readable"?

I'm not going to try to answer that question, because it is entirely
the wrong question.  An example this small can be coped with even if
it is badly written.

The right question is:
	are there any objective reasons for preferring one approach
	to the other?

Let's face it, a very important consideration for Quintus is that we take
money from people in return for something we claim to be an Edinburgh-
compatible Prolog.  So we have three choices:
   1.	have get0/1 return a special code (26 as in DEC-10 Prolog,
	or -1 as in C-Prolog; we picked the latter) and have read/1
	return a special code, because that's what real Edinburgh
	Prologs do, or
   2.	stop claiming to be Edinburgh compatible, or
   3.	deliberately lie to our customers.
I think it's clear that number 1 is a defensible choice.

So that's a reason why a Prolog vendor who claims to provide an
Edinburgh-compatible Prolog should do what we do.  But is there any
reason why this is a *good* thing to do in itself?

Yes, there is.

With the possible exception of applying-a-function-to-some-arguments,
an operation on its own isn't much good.  You need a method for using
it, and it has to fit well with the other operations available.  In
particular, the question of how we read a single character isn't all
that interesting:  what's interesting is "how do we write a program
that reads a lot of characters".  For example, how do you write a
tokeniser for Prolog or Pascal or whatever in Prolog?

Where would you look for advice about how to write programs that do
input?  Where but in a book about compilers.  And a good book on that
topic is the "Dragon" book:

	Compilers: Principles, Techniques, and Tools
	Aho, Sethi, & Ullman
	Addison-Wesley 1986
	ISBN 0-201-10088-6

To learn how to write a tokeniser, we might look at 'lexer.c' in 
section 2.9 (page 74).  You will note that this program relies on the
fact that C streams end with an "EOF" end-marker.  Perhaps more
seriously, the whole "transition diagram"/"finite state automaton"
approach described in chapter 3 relies on "end of file" being a
source character like any other.  (See fig 3.22, for example.)
Indeed, end-markers are so much taken for granted in parsing that
"$" crops up in chapter 4 without much introduction.

We can convert a deterministic finite-state automaton to Edinburgh
Prolog with very little effort.  We represent a state of the automaton
by a predicate with one argument: the next character.  We represent an
arc of the automaton by a clause.  For example, the arcs

	s1: a -> s2.
	s1: b -> s1.
	s1: $  -> accept.

would be coded like this:
	
	s1(0'a) :- get0(Next), s2(Next).
	s1(0'b) :- get0(Next), s1(Next).
	s1(- 1) :- true.

The correspondence is exact.  I think this is an important practical
point.  I offer the following anecdote as evidence:  a year after
learning Prolog I decided that I wanted to write a Prolog system of my
own, but couldn't figure out how to write the tokeniser, so I abandoned
the entire project.  Two years later, I was talking to someone and
suggested this approach as a method for writing programs that use get0/1,
and suddenly it dawned on me that it would work.  One evening of coding
later, I had a Dec-10 Prolog tokeniser written in Dec-10 Prolog.  If you
need some lookahead, you simply add extra arguments to a few states to
carry the looked-ahead characters.

What would it do to us if get0/1 failed at the end of a file?
Unpleasant things:  the test for end of file has to be pushed up into
the states *before* the state which expects end of file.  Every arc of
the form

	sN(X) :- get0(Next), s1(Next).

has to be rewritten as 

	sN(X) :- bad_get0(Next), !, s1(Next).
	sN(X) :- s1_eof.

and s1 has to be written as

	s1(0'a) :- bad_get0(Next), s2(Next).
	s1(0'b) :- bad_get0(Next), !, s1(Next).
	s1(0'b) :- /* EOF */      s1_eof.

	s1_eof :- true.

The cleanest way to avoid this mess is to use a get0/1 which returns an
end-marker, and if your vendor won't provide you with one, you'll have
to write one yourself.

+------------------------------------------------------------------------+
|	An important reason for get0/1 returning an end-marker at	 |
|	the end of a stream is that this forms part of a practice	 |
|	of writing character-reading code.				 |
+------------------------------------------------------------------------+

Note that this practice existed prior to Prolog:  we didn't have to figure
out anything new.  

Is there any other reason for preferring the Edinburgh Prolog version of
get0/1?

Yes.

Suppose I have a goal
	bad_get0(X)
and it fails.  Does that mean that the end of the stream has been reached?
***NO***.  It means that *EITHER* the end of the stream has been reached
*OR* a character has been read which didn't happen to unify with X.  Is
there anything we can do afterwards to find out which was the case?  No.
Assuming, for the sake of argument, a predicate at_eof/0 which succeeds
when we are at the end of the current input stream,

	... ( bad_get0(X) -> /* character read */
	    ; at_eof -> /* now at end of stream */
	    ; /* otherwise it was unification failure */
	    ) ...
isn't quite right.  If there was precisely one character left in the
input stream, bad_get0(X) will consume it, and if the unification fails,
at_eof will *now* see the end of the file.

The problem is that bad_get0/1 has a side-effect (consuming one character
from the current input stream) which it SOMETIMES does and sometimes
doesn't do.  It's similar to playing Russian roulette, except that the
gun is pointed at the foot rather than the head.

It is interesting that Marien and Demoen fell into exactly this trap.
They say:
>
> BTW, if you ever want to convert a program with a different interpretation,
> the solution is easy :
> 
> /*QP*/read(X) :- /*bim*/read(X), ! .
> /*QP*/read(whatever_is_used_to_indicate_end_of_file) .
> 
It may be easy, but it isn't a solution.  Suppose we write this:

	buggy_read(Term) :- bim_read(Term), !.
	buggy_read(end_of_file).

Now, suppose the current input stream contains
	fred.
and we call
	buggy_read(end_of_file).
IT WILL SUCCEED!  It should have failed.


Now, have I shown that the BIM approach is bad?  No.
What I have shown is that the end-marker approach which Quintus Prolog
inherited from C Prolog (which got it from DEC-10 Prolog, which I think
got it from Pop-2) is accompanied by a straightforward discipline for
using it to write tokenisers.  (In fact, it's exactly the same approach
you use in C.)  I am not aware of a similar methodology for the
end-failure approach, despite having asked Bart Demoen at the Prolog
Benchmarking Workshop for enlightenment on this point.  Again, this
doesn't mean that there isn't any such methodology, only that I don't
know what it might be.  If there is a straightforward way of turning
deterministic transition diagrams into end-failure code, I would be
pleased to be instructed in it.