Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!sdcsvax!ucsdhub!hp-sdd!hplabs!sdcrdcf!trwrb!cadovax!gryphon!greg From: greg@gryphon.UUCP Newsgroups: comp.std.internat,sci.lang Subject: Re: Computers and human languages (was Re: What is a byte) Message-ID: <1413@gryphon.CTS.COM> Date: Wed, 2-Sep-87 07:22:27 EDT Article-I.D.: gryphon.1413 Posted: Wed Sep 2 07:22:27 1987 Date-Received: Fri, 4-Sep-87 04:30:23 EDT References: <218@astra.necisa.oz> <142700010@tiger.UUCP> Reply-To: greg@gryphon.CTS.COM (Greg Laskin) Organization: Trailing Edge Technology, Redondo Beach, CA Lines: 94 Xref: utgpu comp.std.internat:195 sci.lang:1205 In article <481@kuling.UUCP> andersa@kuling.UUCP (Anders Andersson) writes: >If the codes for different forms are the same, then the typography will >have to depend on context, so there has to be three font bitmaps, or types >on a printing wheel/chain, and a neat little algorithm instead of a 1-1 >mapping for translating character codes into font table indices. I'm not >arguing against the solution, just pointing out some extra problems. There are generally form forms as we shall see shortly. The algorithm is not neat at all. If you are dealing with a crt display, displaying English and Arabic, and implementing direct cursor positioning, the placement of a single character will require on of eighteen distinct display rearrangement algorithms. > >This leads me to another question: Which form is used in a Persian word >consisting of only one letter? The following is from my experience in implementing Arabic. Persian is similar. There are four forms to each letter. The text is written from right-to-left so the beginning form is on the right-most end. The forms (cases) are beginning (connected to the character on the left), middle (connected on both sides), ending (connected to the character on the right) and alone (not connected to anything). There are 8 characters, as I recall, that never connect to the left even when they appear in the middle of a word. These letters need only two cases since the alone and ending case can double for the beginning and middle cases. In our code set, the lower 128 cells were standard ASCII with upper and lower case Latin characters. The upper 128 codes were Arabic characters. The letters occupied sticks 15 and 16. Sticks 13 and 14 were diacritical marks and the extender character. The extender character was a letter extender. In Arabic writing, margin justification is accomplished by by extending the intra-letter connections as opposed to adding whitespace between letters or words so a letter extender was required. Graphics like ! @ # $ % ^ etc., were represented in both the upper and lower code tables. When processing a stream of codes it was necessary to know the language attribute of the special graphics. For example (loosely), ">" in English meant "greater-than". In Arabic it means "less than". Actually, it has no meaning in Arabic but I was implementing a programming language. There is one ligature, lom-alif (forgive spelling ... I'm not literate in Arabic, I only implemented 4 terminals, 6 printers, an operating system, a programming language and a word processor and I still can't read, write or speak the language) that is used when an alif immediately follows a lom. The lom-alif ligature uses one display cell rather than the two cells that would be used by lom and alif displayed separately. We handled diacritics by assigning a separate code to each mark. The effect was that the code stream was composed of variable length display elements. For example the code stream might be: letter letter diacritic letter diacritic extender [extender ...] lom alif lom alif diacritic and various combinations of the above. A more complete implementation would also have allowed multiple diacritics following a letter. letter used 1 display cell. lom followed by alif was counted as two letters but used one display cell (lom-alif ligature). Diacritics were displayed in the same cell as the letter with which they were associated and thus required no display space. Extenders required 1 display cell but did not count as a letter. The ending form of some letters used 2 display cells. To sort, letters were effectively expanding to 16 bits. The letter code was the upper 8 bits and the diacritic (0x0 if none) the lower 8 bits. Search strings were similarly expanded. Optionally a null diacritic in a search string would match any diacritic in a target string. I used 143 character generated graphics to represent all of the Arabic letters, numerals and graphics unique to Arabic. In addition there was a standard English character generator. There were a couple interesting problems that were never quite fully resolved. Since displayed letters were variable length (remember extenders), the concept of column X was ambiguous; did we mean physical column X or letter X? There is considerable disagreement in the Arabic speaking world as to the format of an appropriate code set. One code set has 5 or 6 cells devoted to the lom-alif ligature with various diacritic marks. There is disagreement whether lom-alif should be a character by itself or simply a ligature formed by the display system. I believe this is an offshoot or Arabic typewriters having a lom-alif key. -- Greg Laskin "When everybody's talking and nobody's listening, how can we decide?" INTERNET: greg@gryphon.CTS.COM UUCP: {hplabs!hp-sdd, sdcsvax, ihnp4}!crash!gryphon!greg UUCP: {philabs, scgvaxd}!cadovax!gryphon!greg