Path: utzoo!attcan!uunet!pmafire!uudell!sequoia!execu!cs.utexas.edu!usc!zaphod.mps.ohio-state.edu!caen!uflorida!travis!tom
From: tom@ssd.csd.harris.com (Tom Horsley)
Newsgroups: comp.std.c++
Subject: Re: design by committee (was: templates and exceptions in g++?)
Message-ID: <TOM.90Dec1214729@hcx2.ssd.csd.harris.com>
Date: 2 Dec 90 02:47:29 GMT
References: <1016@zinn.MV.COM> <1990Nov23.211727.2802@zoo.toronto.edu>
	<1990Nov25.161506.9659@tsa.co.uk>
	<TOM.90Nov26130042@hcx2.ssd.csd.harris.com> <533@taumet.com>
Sender: news@travis.csd.harris.com
Organization: Harris Computer Systems Division
Lines: 49
In-reply-to: steve@taumet.com's message of 1 Dec 90 23:36:23 GMT

>>>>> Regarding Re: design by committee (was: templates and exceptions in g++?); steve@taumet.com (Stephen Clamage) adds:

steve> Our original straightforward implementation of trigraphs
steve> caused a 15% slowdown of the compiler front end.  We spent quite a bit
steve> of time finding an efficient way to handle them, and reduced the
steve> overhead to about 5%.  Please note this affects every program ever
steve> compiled, even ones which contain no trigraphs.

I don't want to sound too insulting here, but I would say you have a seriously
flawed design. I worked on a ANSI C scanner as a sort of academic exercise
while trying to fully understand the way the macro processor works, and my
scanner has no additional overhead to speak of even if you do use trigraphs.

The key to making this work fast is recognizing that you have to examine
each character in the buffer to classify it as you go along anyway. I used a
<ctype.h>-like array that marked "interesting" characters and embedded the
check in a getc()-like macro. The macro normally returns the next character
using inline code, but if an interesting character shows up it calls a
subroutine to do additional processing.  A '\0' character is interesting
because I might have to re-fill the buffer, A '\\' character is interesting
because it might be followed by a newline and both of them will have to be
squeezed out (remember that a backslash followed by a newline has always
been a special sequence you had to check for even before question-mark
question-mark came along - the overhead for tri-graphs is no worse than
this).  With tri-graphs, '?' is now also an interesting character.  Sticking
an extra check for the ?? tri-graph sequence in the subroutine that is only
invoked when an interesting character comes along does not cost that much
extra (unless you have a LOT of question marks in your source code). The
tricky part is making sure you go ahead and fill the buffer if you are
within 4 characters of the end and handling the case of a line terminated by
??/ followed by a newline.

When I do find something like a tri-graph or a \ newline, I squeeze them out
and replace them with what really belongs there. The routine knows where the
current token starts in the buffer, so it just shifts it right to take up
the slack, then it returns the proper character and scanning continues
normally. This allows me to handle the phases of translation which process
tri-graphs and backslash newlines transparently in the GetNextCharacter
macro while I am also busting up the source into tokens.  I can also leave
the tokens in the input buffer without wasting the time copying them around
unless I have to do something like squeeze out a trigraph.
--
======================================================================
domain: tahorsley@csd.harris.com       USMail: Tom Horsley
  uucp: ...!uunet!hcx1!tahorsley               511 Kingbird Circle
                                               Delray Beach, FL  33444
+==== Censorship is the only form of Obscenity ======================+
|     (Wait, I forgot government tobacco subsidies...)               |
+====================================================================+