Path: utzoo!utgpu!news-server.csri.toronto.edu!clyde.concordia.ca!uunet!mcsun!unido!mikros!mwtech!martin From: martin@mwtech.UUCP (Martin Weitzel) Newsgroups: comp.lang.c Subject: Re: LEX with all eight bits? Keywords: LEX Message-ID: <691@mwtech.UUCP> Date: 22 Mar 90 20:04:35 GMT References: <9463@discus.technion.ac.il> Reply-To: martin@mwtech.UUCP (Martin Weitzel) Organization: MIKROS Systemware, Darmstadt/W-Germany Lines: 40 In article <9463@discus.technion.ac.il> joel%techunix.bitnet@jade.berkeley.edu (Yossi (Joel) Hoffman) writes: >Hi folks! I was trying to use LEX to process a text (yes, text) file >that happens to use all eight bits (the 8th bit signifies Hebrew text). >I just inserted the 8-bit letters in the usual way, but LEX choked on >it. (It didn't produce any C output at all.) This couldn't just be >a coincidence; is there anyway I can tell LEX that I'm going to use >all 8 bits? >Any help will be much appreciated. Though there are some efforts to make U*IX '8 Bit clean' I have not yet seen an implementation of 'lex' which gives support for 8-bit chars. The major problem is that 'lex' uses the 8th bit for its own purposes in the compiled representation of the regular expressions (and it seems that no one at AT&T or the software companies which port U*IX are willing to dig into the sources of 'lex' ... :-() SO BE AWARE: Even if 'lex' produces a compilable 'lex.yy.c', the behaviour may be strange if you feed input with the 8th bit set! (This specific problem hit me some time ago and I was searching for hours to track the roots of the behaviour: The pitty is that only *some* few characters trigger the errative situation. So if SOME test input seems to be processed correctly under SOME circumstances, you have no guarantee that ALL input will be processed correctly under ALL circumstances!) Whether there are work arounds or not depends on your problem: If you only want to process all chars whith the high bit set in some more or less uniform way, you may roll your own version of the 'input'-macro and translate the 8-Bit chars to some other representation. Eg you can establish a buffer which parallels 'yytext' where you store the 'real' input, but let the macro return some common representation for all characters, that you treat in the same way anyhow. [To the poster: If you need any further hints mail me a little more about your problem] As a general rule, avoid characters outside the range 1 .. 127 in your input as well as in the regular expression specification! (BTW: Who knows how the PD Version FLEX handles this?) -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83