Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site l5.uucp Path: utzoo!linus!decvax!decwrl!sun!l5!gnu From: gnu@l5.uucp (John Gilmore) Newsgroups: net.internat Subject: The real work of internationalization Message-ID: <191@l5.uucp> Date: Sun, 13-Oct-85 17:08:42 EDT Article-I.D.: l5.191 Posted: Sun Oct 13 17:08:42 1985 Date-Received: Tue, 15-Oct-85 07:21:39 EDT References: <149@ecrcvax.UUCP> <518@talcott.UUCP> Organization: Ell-Five [Consultants], San Francisco Lines: 40 The issue of the 8th bit is not the real problem. It's clear that all the programs that hack the 8th bit will have to be rewritten. The ideal objective is for the same binaries to run anywhere in the world, in any font or language or currency or date/time format. [For now, let's not get off into currency/date/time conversions, and just talk about character set representation issues.] What will cause a LOT of grief is fitting the large Asian character sets in. I saw a memo purported to come from somewhere in AT&T that seemed to be a mix of realism and brain damage. Some of the brain damage included: * a "long char" data type for C -- haven't they ever heard of "short"? * No "locking" select-character-set codes embedded in data streams (like what you'd send to a terminal to enter the "graphics character set"). Instead, they had two different ways to encode extended character sets (beyond 8-bit), and a bit OUTSIDE THE DATA STREAM (eg in the inode of a disk file) that said which format a file was in. The two formats were for places where 8-bit or >8-bit character sets were the norm. I don't think either of those is a viable idea, but I'm not sure that a single representation will suffice UNLESS there are locking character set selections (so the first few bytes of your file would describe its default character sets, if strange). Once you open that can, various other worms come out, like making sure those specs get propagated when you cut and paste in an editor, etc. It's quite a job when you realize that unless ALL the Unix utilities process Asian characters as characters, the system will lose. Any volunteers to hack grep for 16-bit characters encoded in an 8-bit data stream with case shifts? Of course stdio would be modified to encode and decode the extended character set, and that will do much of the work for us. Maybe that should be our first research project -- a public domain stdio that defines a standard programming interface to 16-bit characters and a standard datastream representation for them.