Path: utzoo!mnetor!uunet!husc6!ncar!ames!eos!aurora!labrea!sri-unix!quintus!ok
From: ok@quintus.UUCP (Richard A. O'Keefe)
Newsgroups: comp.lang.prolog
Subject: Re: behavior of read/get0 at end_of_file
Message-ID: <801@cresswell.quintus.UUCP>
Date: 23 Mar 88 09:54:10 GMT
References: <608> <1197@kulcs.kulcs.uucp> <783@cresswell.quintus.UUCP> <518@ecrcvax.UUCP>
Organization: Quintus Computer Systems, Mountain View, CA
Lines: 177
Keywords: get0 read end_of_file

In article <518@ecrcvax.UUCP>, micha@ecrcvax.UUCP (Micha Meier) writes:
> 	By the way, get0/1 does *not* exist in BSI, it uses get_char/1 instead,
> 	and its argument is a character, i.e. a string of length 1.
> 	This means that the type 'character' is inferred from
> 	the type 'string' (and not the other way round like in C).
> 	Does anybody out there know what advantages this can bring?
> 	It is independent on the character <-> integer encoding,
> 	but this only because explicit conversion predicates have
> 	to be called all the time.

I find it extremely odd to call a string of length one a character.
It's like calling a list of integers which contains one element an
integer.  Do we call an array with one element a scalar?

I haven't commented on the BSI's get_char/1 before because for once they
have given a new operation a new name.  There are two problems with it.
A minor problem is that the result being a string, they can't represent
end of file with an additional character, so the fail-at-end approach is
hard to avoid.  (Not impossible.)  There is an efficiency problem:
something which returns an integer or a character constant can just
return a single tagged item, but something which returns a string either
has to construct a new string every time, or else cache the strings somehow.

For example, Interlisp has a function which returns you the next character
in the current input stream, represented as an atom with one character in
its name.  (Well, almost:  characters `0`..`9` are represented by integers
0..9.)  This was quite attractive on a DEC-20, where you could just compute
a table of 128 atoms once and for all.  It wasn't too bad on VAXen either,
where the table had to have 256 elements.  But it because rather more
clumsy on the D machines, which have a 16-bit character set.  (Can you say
"Kanji"?  I knew you could.)  So the alternatives I can see at the moment
are
    o	construct a new string every time.
    o	precompute 2^16 strings.
    o	cache 2^8 strings, and construct a new string every
	time for Kanji and other non-Latin alphabets.
    o	not support Kanji or other non-Latin alphabets at all.
(Can you say "Cyrillic"?  How about "Devanagari"?  You may need the
assistance of a good dictionary; I used to mispronounce "Devanagari",
and probably still do.)

I wrote that
> >For example, the arcs
> >	s1: a -> s2.
> >	s1: b -> s1.
> >	s1: $  -> accept.
> >would be coded like this:
> >	s1(0'a) :- get0(Next), s2(Next).
> >	s1(0'b) :- get0(Next), s1(Next).
> >	s1(- 1) :- true.
Meier says that
> 	In his tutorial to the SLP '87 Richard has taken another
> 	representation of a finite automaton which is more appropriate:
> 	s1 :-
> 		get0(Char),
> 		s1(Char).
> 
> 	s1(0'a) :-
> 		s2.
> 	s1(0'b) :-
> 		s1.
> 	s1(-1) :-
> 		accept.
There wasn't time to go into this in detail in the tutorial, but it
should be obvious that the first approach is more general:  in particular
it can handle transitions where (perhaps because of context) no input is
consumed, and it can handle lookahead.
>	Such representation can
> 	be more easily converted to the BSI's variant of get:
> 	s1 :-
> 		% do the corresponding action
> 		( get0(Char) -> s1(Char)
> 		;
> 		  accept
> 		).
This doesn't generalise as well as the end-marker version.
Here is the kind of thing one is constantly doing:

	rest_identifier(Char, [Char|Chars], After) :-
		is_csymf(Char),
		!,
		get0(Next),
		rest_identifier(Next, Chars, After).
	rest_identifier(After, [], After).

See how this code can treat the end marker just like any other
character:  because it doesn't pass the is_csymf/1 test (copied from
Harbison & Steele, by the way) we'll pick the second clause, and there
is no special case needed for an identifier which happens to be at the
end of a stream.

The fail-at-end approach forces us not only to do something special
with the get0/1 in rest_identifier/3, but in everything that calls it.
(In the Prolog tokeniser, there are two such callers.)

The point is that if-then-elses such as Meier suggests start
appearing all over the place like maggots in a corpse if you adopt
the fail-at-end approach, to the point of obscuring the underlying
automaton.

> 	I must say, none of the two seems to me satisfactory. Richard's
> 	version is not portable due to the -1 as eof character.

If the standard were to rule that -1 was the end of file character,
it would be precisely as portable as anything else in the standard!
In strict point of fact, the Prolog-in-Prolog tokeniser was written
in DEC-10 Prolog for DEC-10 Prolog, and used 26 as the end of file
character, and 31 as the end of line character.  It took 5 minutes
with an editor to adapt it to Quintus Prolog.  I wish C programs
written for UNIX took this little effort to port!

> 	for a Prolog system it is better to have get0/1 return
> 	some *portable* eof (e.g the atom end_of_file, for get0/1
> 	there can be no confusion with source items) instead of
> 	some integer.

It is important that the end-of-file marker, whatever it is, should be
the same kind of thing, in some sense, as the normal values, so that
classification tests such as is_lower/1, is_digit/1, and so on will
just fail quietly for the end-of-file marker, not report errors.  Since
end of file is rare, we would like to test the other cases first.
Pop-2 on the Dec-10 returned integers almost all the time, except that
at the end of a stream you got an end-of-file object which belonged to
another data type (there was only one element of that data type, and it
printed as ^Z).  This was in practice a major nuisance, because before
you could do anything other than an equality test with the result, you
had to check whether it was the end of file mark.

I have been giving out copies of the Prolog-in-Prolog tokeniser to show
how easy it is to program character input with the Edinburgh Prolog
approach.  If someone would give me a tokeniser for BSI Prolog written
entirely in BSI Prolog using the fail-at-end approach, and if that
tokeniser were about as readable as the Prolog-in-Prolog one, that would
go a long way towards convincing me that fail-at-end was a good idea.

> 	BSI objects that if [read/1] returns e.g. the atom end_of_file
> 	then any occurrence of this atom in the source file
> 	could not be distinguished from a real end of file.

That's not a bug, it's a feature!  I'm serious about that.  At Edinburgh,
I had the problem that if someone asked me for help with Prolog, they
might be using one of four different operating systems, where the end
of file key might be
	^Z
or	^D
or	^Y
or	something else which I have been glad to forget.
No problem.  I could always type
	end_of_file.
to a Prolog listener, and it would go away.  Oh, this was so nice!
In fact, on my SUN right now I have function key F5 bound to
"end_of_file.\n" so that I can get out of Prolog without running the
risk of typing too many of them and logging out.

Another thing it is useful for is leaving test data in a source file.
One can do
	<declarations>
	<clauses>
	end_of_file.
	<test cases>
and include the test cases in the program or not just by moving the
end_of_file around.

Ah, you'll say, but that's what nested comments are for!
Well no, they don't work.  That's right, "#| ... |#" is NOT a reliable
way of commenting code out in Common Lisp, and "/* ... */" is NOT a
reliable way of commenting code out in PopLog.  But end_of_file, in
Edinburgh Prolog, IS a reliable way of commenting out the rest of the file.

> 	In this case, a remedy would be the introduction of

Prolog needs a remedy for end_of_file like Elizabeth Schwarzkopf
needs a remedy for her voice.

Before taking end_of_file away from me, the BSI committee should supply
me with a portable way of exiting a break level and a reliable method of
leaving test cases in a file without having them always read.