Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site ima.UUCP
Path: utzoo!decvax!ima!johnl
From: johnl@ima.UUCP (Compilers mailing list)
Newsgroups: mod.compilers
Subject: Re: Generator of Lexical Analyzers, mini-review
Message-ID: <159@ima.UUCP>
Date: Sat, 12-Jul-86 19:05:37 EDT
Article-I.D.: ima.159
Posted: Sat Jul 12 19:05:37 1986
Date-Received: Sat, 12-Jul-86 22:11:03 EDT
References: <138@ima.UUCP>
Reply-To: decvax!utzoo!henry
Lines: 90
Approved: <compilers@ima.UUCP>

On reflection, I think these issues deserve a bit more comment, and I
suspect that some aspects are of sufficiently general interest to make
it a followup rather than a private reply.

> ... one needs to pay close
> attention to lexing.  Have you ever seen LEX used for a production compiler?
> Cc, pc, cpp don't use LEX, nor do most other frequently used compilers.

At no time was I defending the use of LEX for production compilers.  LEX's
strong point is convenience and flexibility, which suits it well to things
like experimenting with notation.  For example, I tend to use LEX when I'm
basically inventing a new language for some specialized job, and I have no
particularly good idea of what it's going to look like.  In this situation,
efficiency is not my major concern.  I would not use LEX for a production
compiler, but that wasn't the issue.

> "... lot of machinery ...".  I like to look at time and space requirements
> of code.  The GLA generated lexer uses a lot less of these than a LEX lexer.

This is an unrealistic comparison, since we have already agreed that LEX
is unsuited to this application.  Actually, when I said "lots of machinery",
I wasn't referring so much to time and space as to the complexity of the
human interface of the scanner generator.

> I assume Henry's comment was really referring to the auxiliary GLA
> software...

Actually, the auxiliary software strikes me as the most valuable part
of GLA, since (as Bob points out) much of it has to be done anyway, and
doing it well is a hassle.  My objection to GLA is that it's not clear
to me that one really needs to top off this useful software with an
elaborate and quite inflexible scanner generator.

>  >I have written lexical analyzers, including two for C.)
> 
> Why did you write several lexers by hand?
> Was it because LEX and regular expressions just did not fit the problem
> at hand? or Time/Space?

LEX could probably have handled the syntax, but since these were meant
to be production-quality scanners, LEX's efficiency problems made it
unacceptable.  A contributing factor for the C lexers in particular
was a desire to avoid dependencies on non-trivial Unix-specific utilities.

> How long does it take to hand build a fast reliable lexer for an
> arbitrary programming language?

Not long, if you (a) bear in mind that programming languages use very
stereotyped lexical forms -- they are not "arbitrary", and (b) work from
an existing high-quality design rather than starting from scratch.  Note
that I did not advocate starting from scratch each time; I advocated
re-using existing code, such as a "boilerplate" scanner.  This is a
much-neglected approach in problems which are (1) stereotyped, (2) too
variable for a library function, and (3) too simple for a program generator.
It doesn't necessarily generate a grade-A scanner, but it yields a B+
simply and quickly.

> >... It's not even
> >very versatile at handling programming languages; for example, it can't
> >handle C's hexadecimal numbers or string continuations.
> 
> ...  There is no inherent limitations in GLA
> that prevent recognition of C hex numbers, or strings that span lines.

My point here was not that there is some sort of intrinsic limit, but
that a piece of software which claims to be a generic lexer generator
in fact cannot generate a lexer for a common, important, not-too-messy
language.  This actually is the heart of my objection to GLA:  it makes
me learn a fairly elaborate piece of machinery which can't cope with a
very wide range of jobs.  I strongly suspect that it was built for doing
one particular language and variants thereon, and nobody took the trouble
to generalize it.

Despite negative implications earlier, I actually do believe that there is
a place for a good programming-language-oriented lexer generator.  However,
the current GLA is not it.  What's needed is something that is relatively
straightforward to use (GLA strikes me as marginal here), consistently
generates grade-A scanners (GLA probably does), and is flexible enough
to handle most programming languages without drastic measures like
hand-editing the resulting code (GLA falls down badly here).  Anybody
want to write one?

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry
-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima

Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request