Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!seismo!husc6!cmcl2!brl-adm!brl-smoke!gwyn From: gwyn@brl-smoke.ARPA (Doug Gwyn ) Newsgroups: comp.lang.c Subject: Re: What is a byte Message-ID: <6216@brl-smoke.ARPA> Date: Sat, 1-Aug-87 23:29:39 EDT Article-I.D.: brl-smok.6216 Posted: Sat Aug 1 23:29:39 1987 Date-Received: Sun, 2-Aug-87 10:57:43 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> Reply-To: gwyn@brl.arpa (Doug Gwyn (VLD/VMB) ) Organization: Ballistic Research Lab (BRL), APG, MD. Lines: 41 In article <851@haddock.ISC.COM> karl@haddock.ima.isc.com (Karl Heuer) writes: >I haven't given this issue a whole lot of thought, but it seems to me that >"short char" should be the smallest object which is addressible in C, and >should define the units of sizeof; "long char" should denote whatever is >necessary to represent the native character set. On a bit-addressible machine >in an Arabic- or Japanese-language environment, one might have "short char" be >1 bit, "char" be 8, and "long char" be 16. That is a bit more generous than my proposal, but it follows the same line of thought. I would prefer that a (char) be capable of holding an entire basic textual unit, since many applications are already based on that assumption. A separate (long char) would necessitate a whole extra collection of str*()-like library routines, which the portable programmer would have to be careful to use instead of the str*() functions; might as well simply make (char) be the right thing and not introduce a new type. Using all three possible char lengths would not pose a serious problem if the str*() functions were changed to rquire (long char) and if implementations made (char) and (long char) the same size, at least for now. There aren't many bit-addressable architectures at present (more's the pity), so most international implementations could make (short char) 8 bits and (char) or (long char) 16 bits. >If this is to be phased in without breaking a lot of programs, X3J11 should >immediately bless all three names, but insist that they all be the same size. >(Which restriction should be deprecated, to be removed in the next standard.) I don't think it's within the realm of practical politics to say that the problem will not be solved until the next issue of the standard. It would be better if it can be solved now without too much breakage of existing non-internationalized code. (Internationalized code is already vendor- specific, due to lack of agreement on a universal approach. Any good solution will require at least some vendors to eventually change.) On a related note, I see that the /usr/group people are trying to change regular expressions from character-code based to language alphabet based (as though there were always a universal collating order for a given language!). This is a most unfortunate direction, since it ruins simple, well-behaved algorithms written in terms of (sufficiently wide) (char)s. I wish they would not plow ahead with this until the multi-byte character issue is resolved, since that may well affect the practical possibilities.