Path: utzoo!utgpu!jarvis.csri.toronto.edu!mailrus!uflorida!haven!adm!smoke!gwyn From: gwyn@smoke.BRL.MIL (Doug Gwyn ) Newsgroups: comp.unix.wizards Subject: Re: POSIX Regular Expression Funnyness Message-ID: <9552@smoke.BRL.MIL> Date: 31 Jan 89 16:59:58 GMT References: <4118f7b1.ae48@apollo.COM> <5980041@hpfcdc.HP.COM> Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) ) Organization: Ballistic Research Lab (BRL), APG, MD. Lines: 19 In article <5980041@hpfcdc.HP.COM> donn@hpfcdc.HP.COM (Donn Terry) writes: >In Doug Gwyn's comments about [:ch:] As far as character classes: >these are specified by the natural language involved. My Spanish is >weak, but the *two characters* ch are treated as a single symbol with >its own place in the collating sequence. c and h can also appear >independently, but when adjacent they are collated as another symbol. >This is arguably a kluge, but it antedates the computer business by a >few hundred years, and a few million users, so I doubt we can change it >just for the sake of aesthetics. My Spanish is not too weak and I'm well aware of ch, ll, nn (written n-tilde), etc. German also has some interesting features (e.g. ss when capitalized). However, we took all this stuff into account when coming up with the multibyte character specifications in the proposed ANSI C standard. The "internationalization" community helped formulate that approach, and it bothers me more than somewhat to see it being ignored by 1003.2. A reasonable implementation of Spanish-language locale requires that ch etc. be multibyte sequences, not handled as multiple separate single-byte characters by "grep".