Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!haven!adm!smoke!gwyn
From: gwyn@smoke.BRL.MIL (Doug Gwyn )
Newsgroups: comp.unix.wizards
Subject: Re: POSIX Regular Expression Funnyness
Message-ID: <9552@smoke.BRL.MIL>
Date: 31 Jan 89 16:59:58 GMT
References: <4118f7b1.ae48@apollo.COM> <5980041@hpfcdc.HP.COM>
Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>)
Organization: Ballistic Research Lab (BRL), APG, MD.
Lines: 19

In article <5980041@hpfcdc.HP.COM> donn@hpfcdc.HP.COM (Donn Terry) writes:
>In Doug Gwyn's comments about [:ch:]  As far as character classes:
>these are specified by the natural language involved.  My Spanish is
>weak, but the *two characters* ch are treated as a single symbol with
>its own place in the collating sequence.  c and h can also appear
>independently, but when adjacent they are collated as another symbol.
>This is arguably a kluge, but it antedates the computer business by a
>few hundred years, and a few million users, so I doubt we can change it
>just for the sake of aesthetics.

My Spanish is not too weak and I'm well aware of ch, ll, nn (written
n-tilde), etc.  German also has some interesting features (e.g. ss when
capitalized).  However, we took all this stuff into account when coming
up with the multibyte character specifications in the proposed ANSI C
standard.  The "internationalization" community helped formulate that
approach, and it bothers me more than somewhat to see it being ignored
by 1003.2.  A reasonable implementation of Spanish-language locale
requires that ch etc. be multibyte sequences, not handled as multiple
separate single-byte characters by "grep".