Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!mnetor!uunet!husc6!cmcl2!rutgers!labrea!decwrl!pyramid!oliveb!sun!gorodish!guy
From: guy%gorodish@Sun.COM (Guy Harris)
Newsgroups: comp.lang.c,comp.std.internat
Subject: Re: What is a byte
Message-ID: <25244@sun.uucp>
Date: Mon, 10-Aug-87 17:29:56 EDT
Article-I.D.: sun.25244
Posted: Mon Aug 10 17:29:56 1987
Date-Received: Wed, 12-Aug-87 02:36:18 EDT
References: <218@astra.necisa.oz> <142700010@tiger.UUCP> <2792@phri.UUCP> <34@piring.cwi.nl>
Sender: news@sun.uucp
Lines: 55
Keywords: 32 bit bytes!  You ain't seen nothin', yet.
Xref: mnetor comp.lang.c:3575 comp.std.internat:87

> Are you suggesting that there are more than 2**20 = 1048576 different
> written words in Chinese?  At typically 60 entries on a page, their
> dictionaries must have then some 17500 pages or more.  I think that 16 bits
> are enough to accommodate all Chinese characters, and certainly ample for
> the about 5000 that are in actual use.

According to a document called "USMARC Character Set: Chinese Japanese Korean",
from the Library of Congress, Washington, a 24-bit character was developed to
"represent and store in machine-readable form all the Chinese, Japanese, and
Korean characters used with the USMARC format."

It says that the character sets incorporated into this character set (the
RLIN - Research Libraries Information Network - East Asian Character Code, or
REACC) are:

	+ *Symbol and Character Tables of Chinese Character Code for
	  Information Interchnage*, vol. 1 and 2 (2nd ed., Nov. 1982) and
	  *Variant Forms of Chinese Character Code for Information Interchange*
	  (2nd ed., Dec. 1982) (CCCII)  Editor:  The Chinese Character Analysis
	  Group.  Total:  33,000 characters.

	  REACC contains all of the 4,807 "most ocmmon" Chinese characters in
	  volume 1 (as listed by the Ministry of Education in Taiwan) and about
	  5,000 of the 17,000 characters taken from a compilation of data from
	  different computer centers (mostly personal names) in volume 2.
	  REACC also contains about 3,000 of the approximately 11,000
	  characters in the CCCII *Variant Forms*, which lists PRC simplified
	  forms and other variants, some of which are also used in modern
	  Japanese.

	+ *Code of Chinese Graphic Character Set for Information Interchange
	  Primary Set:  The People's Republic of China National Standard* (GB
	  2312-80) (1st ed., 1981).  Total:  6,763 characters.  All the
	  characters in this set are in REACC.

	+ *Code of the Japanese Graphic Character Set for Information
	  Interchange:  Japanese Industrial Standard* (JIS C 6226)  (1983).
	  Total:  6,349 characters.  All the characters in this set are in
	  REACC.

	+ *Korean Information Processing System* (KIPS).  Total: 2,392 Chinese
	  characters and 2,058 Korean Hangul.  Chinese characters in this set
	  are in REACC; all hangul are also incoroporated in REACC, as well as
	  some hangul *not* in KIPS.

One characteristic of this character set is that it tries to permit a simple
rule to get the codes for various variant forms of characters from the code for
the traditional form of the character.

So, while you can probably stuff the major Chinese characters into 16 bits (the
CCCII, including variant characters, contains 33,000 characters), you may not
want to.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com