Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!clyde!rutgers!ames!sdcsvax!ucsdhub!hp-sdd!hplabs!sdcrdcf!trwrb!cadovax!gryphon!greg
From: greg@gryphon.UUCP
Newsgroups: comp.std.internat,sci.lang
Subject: Re: Computers and human languages (was Re: What is a byte)
Message-ID: <1413@gryphon.CTS.COM>
Date: Wed, 2-Sep-87 07:22:27 EDT
Article-I.D.: gryphon.1413
Posted: Wed Sep  2 07:22:27 1987
Date-Received: Fri, 4-Sep-87 04:30:23 EDT
References: <218@astra.necisa.oz> <142700010@tiger.UUCP>
Reply-To: greg@gryphon.CTS.COM (Greg Laskin)
Organization: Trailing Edge Technology, Redondo Beach, CA
Lines: 94
Xref: utgpu comp.std.internat:195 sci.lang:1205

In article <481@kuling.UUCP> andersa@kuling.UUCP (Anders Andersson) writes:
>If the codes for different forms are the same, then the typography will
>have to depend on context, so there has to be three font bitmaps, or types
>on a printing wheel/chain, and a neat little algorithm instead of a 1-1
>mapping for translating character codes into font table indices. I'm not
>arguing against the solution, just pointing out some extra problems.

There are generally form forms as we shall see shortly.  The algorithm
is not neat at all.  If you are dealing with a crt display, displaying
English and Arabic, and implementing direct cursor positioning, the
placement of a single character will require on of eighteen distinct
display rearrangement algorithms.
>
>This leads me to another question: Which form is used in a Persian word
>consisting of only one letter?

The following is from my experience in implementing Arabic.  Persian is
similar.  

There are four forms to each letter.  The text is written from right-to-left
so the beginning form is on the right-most end.  The forms (cases) are
beginning (connected to the character on the left), middle (connected on
both sides), ending (connected to the character on the right) and alone
(not connected to anything).  There are 8 characters, as I recall, that
never connect to the left even when they appear in the middle of a word.
These letters need only two cases since the alone and ending case can
double for the beginning and middle cases.  

In our code set, the lower 128 cells were standard ASCII with upper and
lower case Latin characters.  The upper 128 codes were Arabic characters.
The letters occupied sticks 15 and 16.  Sticks 13 and 14 were diacritical
marks and the extender character.  The extender character was a letter
extender.  In Arabic writing, margin justification is accomplished by
by extending the intra-letter connections as opposed to adding whitespace
between letters or words so a letter extender was required.

Graphics like ! @ # $ % ^ etc., were represented in both the upper and lower
code tables.  When processing a stream of codes it was necessary to know
the language attribute of the special graphics.  For example (loosely),
">" in English meant "greater-than".  In Arabic it means "less than".
Actually, it has no meaning in Arabic but I was implementing a programming
language.

There is one ligature, lom-alif (forgive spelling ... I'm not literate in
Arabic, I only implemented 4 terminals, 6 printers, an operating system,
a programming language and a word processor and I still can't read, write
or speak the language) that is used when an alif immediately follows
a lom.  The lom-alif ligature uses one display cell rather than the two
cells that would be used by lom and alif displayed separately.

We handled diacritics by assigning a separate code to each mark. 
The effect was that the code stream was composed of variable length
display elements.  For example the code stream might be:
   letter
   letter diacritic
   letter diacritic extender [extender ...]
   lom alif
   lom alif diacritic
and various combinations of the above.  A more complete implementation would
also have allowed multiple diacritics following a letter.  

letter used 1 display cell.  lom followed by alif was counted as two letters
but used one display cell (lom-alif ligature).  Diacritics were displayed
in the same cell as the letter with which they were associated and thus
required no display space.  Extenders required 1 display cell but
did not count as a letter.  The ending form of some letters used 2
display cells.

To sort, letters were effectively expanding to 16 bits.  The letter code
was the upper 8 bits and the diacritic (0x0 if none) the lower 8 bits.
Search strings were similarly expanded.  Optionally a null diacritic
in a search string would match any diacritic in a target string.

I used 143 character generated graphics to represent all of the
Arabic letters, numerals and graphics unique to Arabic.  In addition there
was a standard English character generator.

There were a couple interesting problems that were never quite fully
resolved.  Since displayed letters were variable length (remember extenders),
the concept of column X was ambiguous; did we mean physical column X or
letter X?  

There is considerable disagreement in the Arabic speaking world as to
the format of an appropriate code set.  One code set has 5 or 6 cells
devoted to the lom-alif ligature with various diacritic marks.  There
is disagreement whether lom-alif should be a character by itself or
simply a ligature formed by the display system.  I believe this is
an offshoot or Arabic typewriters having a lom-alif key.
-- 
Greg Laskin   
"When everybody's talking and nobody's listening, how can we decide?"
INTERNET:     greg@gryphon.CTS.COM
UUCP:         {hplabs!hp-sdd, sdcsvax, ihnp4}!crash!gryphon!greg
UUCP:         {philabs, scgvaxd}!cadovax!gryphon!greg