Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpirs!wk From: wk@hpirs.HP.COM (Wayne Krone) Newsgroups: comp.unix.wizards Subject: Re: POSIX Regular Expression Funnyness Message-ID: <4760014@hpirs.HP.COM> Date: 7 Feb 89 01:51:49 GMT References: <4118f7b1.ae48@apollo.COM> Organization: Hewlett Packard, Cupertino Lines: 100 I was involved in the design of the internationalization extensions to regular expressions as a member of both the X/Open and /usr/group Internationalization working group/committees. I'm not the official spokesperson for either group but I can probably answer most of your questions. In addition, many of the points raised are discussed by the rationale in the P1003.2 draft (section 2.9.1, pages 58-64, of draft 8). Why "[[:alpha:]]" instead of "[:alpha:]" ? The committee was very concerned about the acceptability of any extensions with the folks not actively involved in internationalization due to the possibility of breaking of existing regular expressions. One of the ways we decided to minimize the risk was to require double delimiters on all the new syntax. For example: [[:lower:][:digit:]] rather than [:lower::digit:] Other reasons were to reduce ambiguity problems and to have delimiters which visually indicated left/right closure of the new syntax. More details are in the draft. Why allow "[[.ch.]]" instead of requiring the Spanish ch to be an ANSI C multibyte character? First, because no existing implementation does it that way (that we know of). Second, and more importantly, "c", "h" and "ch" as matched by the RE "[a-z]" are collating elements, not characters. Only "c" and "h" are also characters and thus are represented in a single or multibyte character code set. As someone else noted, "Mac" and "Mc" could be supported as collating elements but it is very unlikely a code set would ever support them as single characters. > What seems like a serious problem to me is that the required nesting > makes the new expressions more difficult to use. Further, misuse of > them in this kind of obvious way leads to silent misbehavior from which > it is difficult to surmise the bug. That is, a user might do: [:alpha:] but intended: [[:alpha:]] and get no error message? Well the comment above is true but if the syntax was as suggested: [:alpha:] but the user typed: [:alphz:] the same silent misbehavior results. I suppose its a matter of guessing which errors will be most common and optimizing the syntax for that set. > the thing that pisses me off is that they want to make \c where c is > a regular (non-special) character exactly equivalent to c, > rather than reserving it for future use. this is baffling to me; > if we reserve \c in these cases, we have easy backward compatible ways > of extending the syntax later on (like allowing more than 9 sub expressions). > And i have no idea who they are protecting; people who have patterns > like \t and expect them to match t ?? This was just a matter of documenting the behavior of the existing implementations (as we know them :-). Much of the regular expression syntax/behavior is left unspecified by the traditional definition on the ed man page and we found ourselves in the position of having to write down something. If your implementation differs or you just feel this is an area worth improving, submit a proposal to P1003.2. > There are more serious problems with the new expressions than just the > obscure syntax. A short while ago I had to design some verification > tests for these new regular expressions as part of the X/Open verification > suite (the latest X/Open standard incorporates POSIX). I found some > ambiguity in the area of 2 to 1 character mappings. For example, if ch > collates between c and d, which of the following REs should match the > string "xchy"? > > x[a-[.ch.]]y > x[a-[.ch.]]hy > > The simple answer would be to create some rule about 2 to 1 character > mappings to eliminate the ambiguity. However, whichever rule is The rule which applies is the "longest leftmost match" rule which is documented in XPG3 for the "RE*" syntax but unfortunately missing from the square bracket rules. So the answer for the examples above is "both": x[a-[.ch.]]y matches x ch y x[a-[.ch.]]hy matches x c h y > We have informed X/Open of the problem, and are waiting to see what they > come up with. That's interesting--I haven't seen any query posted to the Internationalization Working Group. Wayne Krone