Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site ima.UUCP
Path: utzoo!decvax!ima!johnl
From: johnl@ima.UUCP (Compilers mailing list)
Newsgroups: mod.compilers
Subject: Re: Generator of Lexical Analyzers, mini-review
Message-ID: <138@ima.UUCP>
Date: Tue, 24-Jun-86 11:30:42 EDT
Article-I.D.: ima.138
Posted: Tue Jun 24 11:30:42 1986
Date-Received: Tue, 24-Jun-86 17:55:57 EDT
Reply-To: decvax!allegra!boulder!bob (Bob Gray)
Lines: 94
Approved: <compilers@ima.UUCP>

Thank you Henry for the mini-review on GLA.  There are a couple
of points that I would like to bring out.  I'm not sure there
was a need for you to be "impressed".  GLA merely generates a scanner
for programming languages that runs fast.  I think the main point
is 1) why did we build GLA, and 2) why use it.  You answered both points:

1> It may be fast, but it's a lot of machinery for rather modest results.

Well, it is fast.  Let me give some motivation. Last fall I was 
comparing and measuring the speed/size of GAG generated compilers with
hand written compilers.  Based on rumors, I had expected that the semantic
processing was going to be terrible for the GAG generated compilers.
Well surprise, lexing and parsing took 75% of the analysis time (no codegen).
It was the "easy" stuff that slowed down the GAG generated compiler!
We replaced the GAG provided lexer and parser with an earlier GLA, and
and a fast parser we were working on.  The resulting analyzer (for pascal)
ran faster than 4.2BSD pc, but not as fast as VMS-PASCAL.

There is plenty of additional evidence that shows that one needs to pay close
attention to lexing.  Have you ever seen LEX used for a production compiler?
Cc, pc, cpp don't use LEX, nor do most other frequently used compilers.
People write their own lexers because at this phase of the compilation
process, EVERY character has to be examined and you want this examination to
go very fast.  (Parsing looks at tokens, which occur 5-10 times less
frequently than characters of input).

"... lot of machinery ...".  I like to look at time and space requirements
of code.  The GLA generated lexer uses a lot less of these than a LEX lexer.
I assume Henry's comment was really referring to the auxiliary GLA
software for source input, error reporting in the source coordinates,
string memory (you don't want to use malloc), identifier table and
keywords, and denotation value management.  Both LEX and the hand
approach require that you provide all of this yourself AND work out
the non-trivial interfacing details to the parser.

2>The problem of scanning conventional programming languages just
 >isn't complex enough to justify a program-generator system.
 >I have written lexical analyzers, including two for C.)

Now here is the meat of the discussion.  You imply that scanning
is not very complex (true), but does this mean that lexers should be
constructed by hand?
I definitely don't believe that just because something is easy
(arithmetic for example), that we should ignore tools (calculators).
Why not use whatever tools help accomplish a quality product.
Furthermore, how many people know about or care to deal with the subtle
issues of lexing such as compact case labels?  Your lexer will run
up to 25% slower if you "switch" on the input character.  (Anyone interested
should see W.M. Waite 'Cost of Lexical Analysis', Software P&E 16 1986).

And what about reliability?  The code to properly handle tabs,
comments and floating point numbers is tricky.  How about
efficient keyword recognition?  There are so many pitfalls.
How long does it take to hand build a fast reliable lexer for an
arbitrary programming language?  I would much rather give a specification
to a tool so I will have time to work on hard, interesting problems.

>It is very much
>oriented towards fairly conventional programming languages, and is not
>a general-purpose substitute for LEX (I don't think they ever claimed
>it was, but the point should be made more explicitly).  It's not even
>very versatile at handling programming languages; for example, it can't
>handle C's hexadecimal numbers or string continuations.

Yes, GLA is not a regular expression tool, but please acknowledge that
one often has to step outside of regular expressions to handle common
lexing problems.   (Every see the contortions necessary to process
a C style comment in LEX?)  Why did you write several lexers by hand?
Was it because LEX and regular expressions just did not fit the problem
at hand? or Time/Space?  There is no inherent limitations in GLA
that prevent recognition of C hex numbers, or strings that span lines.
It is just that we did not implement that stuff in the current version.
It is not hard to take the produced lexer (in C code) and add things
we have not implemented.

>To my mind, the most useful part of the package would be several libraries
>provided with it, encapsulating things like string storage and target-
>machine integer arithmetic.  Given the limited domain of application of
>GLA, I suspect that a better approach would be to invest a bit more effort
>on the libraries and supply a "boilerplate" C scanner invoking them.

This IS the approach used, just don't trivialize the "boilerplate".
GLA dynamically has to produce the boilerplate or else the resulting
scanner will suffer in speed.  Yes, GLA is not a exciting, complex tool-
but it saves lots of tedious time when one needs a lexer.

-bob gray  bob@boulder.csnet {seismo!hao, ucbvax!nbires}!boulder!bob

-- 
Send compilers mail to ima!compilers or, in a pinch to Levine@YALE.EDU
Plausible paths are { ihnp4 | decvax | cbosgd | harvard | yale | bbncca}!ima

Please send responses to the originator of the message -- I cannot forward
mail accidentally sent back to compilers.  Meta-mail to ima!compilers-request