Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!tut.cis.ohio-state.edu!ucbvax!hplabs!hpda!hpcupt1!hpirs!wk
From: wk@hpirs.HP.COM (Wayne Krone)
Newsgroups: comp.unix.wizards
Subject: Re: POSIX Regular Expression Funnyness
Message-ID: <4760014@hpirs.HP.COM>
Date: 7 Feb 89 01:51:49 GMT
References: <4118f7b1.ae48@apollo.COM>
Organization: Hewlett Packard, Cupertino
Lines: 100

I was involved in the design of the internationalization extensions to
regular expressions as a member of both the X/Open and /usr/group
Internationalization working group/committees.  I'm not the official
spokesperson for either group but I can probably answer most of your
questions.  In addition, many of the points raised are discussed by the
rationale in the P1003.2 draft (section 2.9.1, pages 58-64, of draft 8).

Why "[[:alpha:]]" instead of "[:alpha:]" ?

   The committee was very concerned about the acceptability of any
   extensions with the folks not actively involved in internationalization
   due to the possibility of breaking of existing regular expressions.
   One of the ways we decided to minimize the risk was to require
   double delimiters on all the new syntax.  For example:

			[[:lower:][:digit:]]
   rather than
			[:lower::digit:]

   Other reasons were to reduce ambiguity problems and to have delimiters
   which visually indicated left/right closure of the new syntax.  More
   details are in the draft.

Why allow "[[.ch.]]" instead of requiring the Spanish ch to be an
ANSI C multibyte character?

   First, because no existing implementation does it that way (that we
   know of).  Second, and more importantly, "c", "h" and "ch" as matched
   by the RE "[a-z]" are collating elements, not characters.  Only "c"
   and "h" are also characters and thus are represented in a single or
   multibyte character code set.  As someone else noted, "Mac" and "Mc"
   could be supported as collating elements but it is very unlikely a
   code set would ever support them as single characters.

> What seems like a serious problem to me is that the required nesting
> makes the new expressions more difficult to use.  Further, misuse of
> them in this kind of obvious way leads to silent misbehavior from which
> it is difficult to surmise the bug.

That is, a user might do:

			[:alpha:]
but intended:
			[[:alpha:]]

and get no error message?  Well the comment above is true but if the
syntax was as suggested:

			[:alpha:]
but the user typed:
			[:alphz:]

the same silent misbehavior results.  I suppose its a matter of guessing
which errors will be most common and optimizing the syntax for that set.

> the thing that pisses me off is that they want to make \c where c is
> a regular (non-special) character exactly equivalent to c,
> rather than reserving it for future use. this is baffling to me;
> if we reserve \c in these cases, we have easy backward compatible ways
> of extending the syntax later on (like allowing more than 9 sub expressions).
> And i have no idea who they are protecting; people who have patterns
> like \t and expect them to match t ??

This was just a matter of documenting the behavior of the existing
implementations (as we know them :-).  Much of the regular expression
syntax/behavior is left unspecified by the traditional definition on
the ed man page and we found ourselves in the position of having to
write down something.  If your implementation differs or you just feel
this is an area worth improving, submit a proposal to P1003.2.

> There are more serious problems with the new expressions than just the
> obscure syntax.  A short while ago I had to design some verification
> tests for these new regular expressions as part of the X/Open verification
> suite (the latest X/Open standard incorporates POSIX).  I found some
> ambiguity in the area of 2 to 1 character mappings.  For example, if ch
> collates between c and d, which of the following REs should match the
> string "xchy"?
> 
> 	x[a-[.ch.]]y
> 	x[a-[.ch.]]hy
>
> The simple answer would be to create some rule about 2 to 1 character
> mappings to eliminate the ambiguity.  However, whichever rule is

The rule which applies is the "longest leftmost match" rule which
is documented in XPG3 for the "RE*" syntax but unfortunately missing
from the square bracket rules.

So the answer for the examples above is "both":

	x[a-[.ch.]]y	matches    x ch y
	x[a-[.ch.]]hy	matches    x c h y

> We have informed X/Open of the problem, and are waiting to see what they
> come up with.

That's interesting--I haven't seen any query posted to the
Internationalization Working Group.

Wayne Krone