Xref: utzoo comp.text:7818 comp.mail.misc:4650 Path: utzoo!utgpu!news-server.csri.toronto.edu!rutgers!att!tut.cis.ohio-state.edu!snorkelwacker.mit.edu!bloom-beacon!eru!hagbard!sunic!dkuug!dkuugin!keld From: keld@login.dkuug.dk (Keld J|rn Simonsen) Newsgroups: comp.text,comp.mail.misc Subject: Re: International character set requirements needed Keywords: 8-bit data, mail Message-ID: Date: 1 Jan 91 20:50:12 GMT References: <1990Dec20.012516.23623@ico.isc.com> <1990Dec27.043500.27639@cbnewsk.att.com> <5044@exodus.Eng.Sun.COM> <1990Dec31.004055.10335@cbnewsk.att.com> Sender: news@slyrf.dkuug.dk Followup-To: comp.text Lines: 70 hansen@pegasus.att.com (Tony L. Hansen) writes: >< From: tut@cairo.Eng.Sun.COM (Bill "Bill" Tuthill) >< True. But there is no technical reason (other than short-sightedness) >< why SMTP has to strip off the 8th (high) bit. There are in fact >< working versions of sendmail that don't disturb the 8th bit. This introduces a problem with "embedded slashes" which are now represented internally in Sendmail with the 8th bit set. Have anybody got Sendmail patches to remedy this? >I agree completely, there is no reason to limit SMTP to 7-bits. >Unfortunately, the standard currently REQUIRES the stripping and doing >anything else is non-standard. I would definitely support changing the >standard to allow an arbitrary 8-bit byte stream. This would also require >eliminating the limitation of 1024-byte lines and anything else in the >standard which is not content transparent. I am much in favour of extending the character set supported by SMTP. But you should be careful. What is the meaning of a 8-bit character? Well, depends on the character set employed. Today we know that only 7-bit ASCII is allowed. But with 8-bit mail, is this octal code 0162 coming over the line an "small a with acute accent" (as in ISO 8859-1:1987), a Cent sign (as in IBM CP 437) or a "capital A with circumflex" (as in HP Roman8)? This might become a real problem given the current shares on the UNIX market. Just displaying the 8bit data to a user may be very confusing. It may even do strange things to your terminal equipment if IBM Codepage character set is employed, as some of the characters here are in the upper control character sets of ISO 8859 and other vendors chararacter sets. Should one then just say "Use ISO 8859"? Well, what ISO 8859? There are several parts, latin 1, latin 2 (eastern Europe), Greek, Cyrillic, Arabic, Hebrew (among others). The abovementioned character 0162 has different meanings in these different character sets. ISO 8859-1 would be the natural choice (and is also specified in a recent RFC on encoding: header.) But is that fair? I think that is like inventing a new ASCII, only capable of serving one region of the world sufficiently - this time having Western Europe (EEC) and all of North and South America covered. We should do something that could cover the whole world. It is also quite hard to persuade your manufactures to change their implementation character set, and even worse for equipment you already have bought and installed. Some of this may even be running software with no 8-bit capabilities! I think it would be nice to be able to support all of these new and oldie systems, and I have done an implementation of Sendmail capable of supporting more than 60 character sets. It currently does not touch the headers, but only the mail body. For characters not in the current character set, it encodes this character with a mnemonic code, for example a' for the above mentioned "small a with acute". Thus even in ASCII you can get the message! The sendmail patches are available with anon ftp in dkuug.dk:pub/ch.shar and sm5.64.8+bit.pa (sm.8+bit.pa for 5.61). Its about 100 kb - the Sendmail patches itself is under 100 lines, the rest is the character set stuff. It has been running here at dkuug.dk since Feb 90. A new ISO standard is showing up: ISO 10646 (which just has been published as a Draft International Standard (DIS)). This covers all characters in the world, with very few exceptions. And the exceptions are planned to be included in a later issue. Actually Dan Oscarsson and I have been planning (mostly Dan) to do a SMTP implementation for Sendmail negotiation 10646 for transmission, and write an RFC for this character set negotiation. Keld Simonsen