Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!mnetor!uunet!husc6!cmcl2!rutgers!labrea!decwrl!pyramid!oliveb!sun!gorodish!guy From: guy%gorodish@Sun.COM (Guy Harris) Newsgroups: comp.lang.c,comp.std.internat Subject: Re: What is a byte Message-ID: <25244@sun.uucp> Date: Mon, 10-Aug-87 17:29:56 EDT Article-I.D.: sun.25244 Posted: Mon Aug 10 17:29:56 1987 Date-Received: Wed, 12-Aug-87 02:36:18 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <34@piring.cwi.nl> Sender: news@sun.uucp Lines: 55 Keywords: 32 bit bytes! You ain't seen nothin', yet. Xref: mnetor comp.lang.c:3575 comp.std.internat:87 > Are you suggesting that there are more than 2**20 = 1048576 different > written words in Chinese? At typically 60 entries on a page, their > dictionaries must have then some 17500 pages or more. I think that 16 bits > are enough to accommodate all Chinese characters, and certainly ample for > the about 5000 that are in actual use. According to a document called "USMARC Character Set: Chinese Japanese Korean", from the Library of Congress, Washington, a 24-bit character was developed to "represent and store in machine-readable form all the Chinese, Japanese, and Korean characters used with the USMARC format." It says that the character sets incorporated into this character set (the RLIN - Research Libraries Information Network - East Asian Character Code, or REACC) are: + *Symbol and Character Tables of Chinese Character Code for Information Interchnage*, vol. 1 and 2 (2nd ed., Nov. 1982) and *Variant Forms of Chinese Character Code for Information Interchange* (2nd ed., Dec. 1982) (CCCII) Editor: The Chinese Character Analysis Group. Total: 33,000 characters. REACC contains all of the 4,807 "most ocmmon" Chinese characters in volume 1 (as listed by the Ministry of Education in Taiwan) and about 5,000 of the 17,000 characters taken from a compilation of data from different computer centers (mostly personal names) in volume 2. REACC also contains about 3,000 of the approximately 11,000 characters in the CCCII *Variant Forms*, which lists PRC simplified forms and other variants, some of which are also used in modern Japanese. + *Code of Chinese Graphic Character Set for Information Interchange Primary Set: The People's Republic of China National Standard* (GB 2312-80) (1st ed., 1981). Total: 6,763 characters. All the characters in this set are in REACC. + *Code of the Japanese Graphic Character Set for Information Interchange: Japanese Industrial Standard* (JIS C 6226) (1983). Total: 6,349 characters. All the characters in this set are in REACC. + *Korean Information Processing System* (KIPS). Total: 2,392 Chinese characters and 2,058 Korean Hangul. Chinese characters in this set are in REACC; all hangul are also incoroporated in REACC, as well as some hangul *not* in KIPS. One characteristic of this character set is that it tries to permit a simple rule to get the codes for various variant forms of characters from the code for the traditional form of the character. So, while you can probably stuff the major Chinese characters into 16 bits (the CCCII, including variant characters, contains 33,000 characters), you may not want to. Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com