Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!sundc!pitstop!sun!amdcad!ames!ucbcad!ucbvax!hplabs!sdcrdcf!ism780c!ism780b!greger From: greger@ism780b (Greger Leijonhufvud) Newsgroups: comp.std.internat Subject: Re: A customable string-comparison package Message-ID: <7757@ism780c.UUCP> Date: Thu, 5-Nov-87 00:07:37 EST Article-I.D.: ism780c.7757 Posted: Thu Nov 5 00:07:37 1987 Date-Received: Sun, 8-Nov-87 00:36:50 EST References: <2428@enea.UUCP> Sender: nobody@ism780c.UUCP Reply-To: greger@ism780c.UUCP (Greger Leijonhufvud) Followup-To: comp.std.internat Organization: Interactive Systems Corp., Santa Monica CA Lines: 77 Summary: Facility exists in standards and available systems In article <2428@enea.UUCP> sommar@enea.UUCP(Erland Sommarskog) writes: > > comp.std.internat only> > >Some months ago I asked for a new character concept where >the result of string comparisons should depend on the >selected langauges. I stated that the old ASCII concept >one character <=> one collation value must go away. I hope Erland and others are aware of the work done in the ANSI X3.11 and POSIX organizations. The problem with string compare was extensively discussed during the "final" phase of X3.J11 and especially in the Internationalization "subcommittee". The proposal (which, hopefully, will become a full standard) identifies two specific new library functions intended to provide support for collation which is not dependent on the physical encoding. Both are dependent on some (user-selectable) external information on the desired collation sequence. The two functions are: strcoll(3,C) and strxfrm(3C). They differ in that strcoll performs a compare of two items (as strcmp) according to the desired collation order, while strxfrm transforms the string according to the external information such that a subsequent strcmp using the "native" collation can be performed. Strcoll is useful in occasional compares, while strxfrm is intended for repeated compares, as in a sort (the table-driven compares are qite slow, compared to the native compare, so a pre-transformation before the actual sorting is quite advantageous). Strcoll is also supported in the X/OPEN specifications, as nl_strcmp and nl_strncmp. Recently, the /usr/group Internationalizarion committee has made some proposals to POSIX P1003.2 (commands & utilities) in the area of regular expressions which draw heavily on these facilities. In all cases, the collation order allows the "user" (actually, this is more of an administrator type of job) to specify a collation order in which the ordering is independent of character values. In addition, the user can specify 1. that a string of characters sort as one (example: Spanish ch and (ll), 2. that one character sorts as a string (example: German duble s), 3. that several characters can have the same collation order (example: accented e's sort with unaccented e), 4. that, if two strings containing such "equivalent" characters collate equal, then the order between them depends on a "secondary" collation value. 5. that characters can be designated as "don't care", i.e. are disregarded when comparing. As can be seen, this does change the collation from character-oriented to string-oriented. And finally, there are several UNIX systems on the market (notably, the X/OPEN ones, inluding HP's, and one from IBM) which does provide this functionality. If there is an interest, I am more than happy to post more elaborate descriptions of these thinks to the net. ------ Greger Leijonhufvud INTERACTIVE Systems Corporation Santa Monica, CA. 90404 "The above views does not represent anything but mine own..." Reverse the polarity of the neutron flow!