Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!uunet!pmafire!uudell!bigtex!texsun!newstop!exodus!cairo.Eng.Sun.COM!tut From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) Newsgroups: comp.text Subject: Re: International character set requirements needed Keywords: 8-bit data, mail Message-ID: <5118@exodus.Eng.Sun.COM> Date: 2 Jan 91 19:56:57 GMT References: <1990Dec20.012516.23623@ico.isc.com> Sender: news@exodus.Eng.Sun.COM Lines: 52 keld@login.dkuug.dk (Keld J|rn Simonsen) writes: > > Should one then just say "Use ISO 8859"? Well, what ISO 8859? > There are several parts, latin 1, latin 2 (eastern Europe), > Greek, Cyrillic, Arabic, Hebrew (among others)... > We should do something that could cover the whole world. This is what Unicode is for. Unicode should be considered the most useful and implementable subset of the draft standard ISO 10646. Unicode is an unambiguous fixed-length 16-bit global codeset currently under development by the Unicode Consortium. Unicode offers a uniform text and character standard that can encompass all living languages and form a long-lasting basis for worldwide data exchange. Unicode makes all 65,535 slots available, with these constraints: o The first 256 slots duplicate the arrangement of ASCII and ISO Latin-1. o Characters unique to a language are grouped together in standard order. o Letters, punctuation, symbols, and diacritics shared by multiple languages are grouped together. o Asian pictographs are grouped together in order of frequency (as specified by national standards), then sorted in traditional radical/stroke order. o Chinese, Japanese, and Korean phonetic symbols are grouped together by language in standard order. The reason 16 bits are enough is that Asian pictographs which everyone would recognize as the same have been unified. Thus, more than 31,000 characters have been reduced to about 20,000 slots. Major Han Character Standards Country Standard Year Characters ------- -------- ---- ---------- China GB 2312 1980 6,763 Japan JIS X0208 1983 6,349 Korea KS C5601 1987 4,888 Taiwan CNS 11643 1986 13,051 ------ total 31,051 In addition to East Asian languages, here are the writing systems currently available in Unicode: Greek, Cyrillic, Georgian, Armenian, Hebrew, Arabic, Ethiopian, Devanagari, Bengali, Gurmukhi, Gujarti, Oriya, Tamil, Telegu, Kannada, Malayalam, Sinhalese, Thai, Lao, Burmese, Khmer, Tibetan, and Mongolian.