Xref: utzoo comp.text.tex:3601 soc.culture.german:1896
Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!wuarchive!rice!egeria.rice.edu!dorai
From: dorai@egeria.rice.edu (Dorai Sitaram)
Newsgroups: comp.text.tex,soc.culture.german
Subject: Re: Ascii German -> Diacritics
Message-ID: <1990Oct31.220738.18939@rice.edu>
Date: 31 Oct 90 22:07:38 GMT
References: <1990Oct20.042619.10038@rice.edu> <3513@gmdzi.gmd.de>
Sender: news@rice.edu (News)
Organization: Rice University, Houston
Lines: 64

In article <3513@gmdzi.gmd.de> icking@gmdzi.gmd.de (Werner Icking) writes:
>dorai@tone.rice.edu (Dorai Sitaram) writes:
>[...]
>>I brewed a very little lex program "diac" which converts AG into DG
>>using context information to figure out which ae/oe/ue/ss get
>>converted.  The output is TeX source style.
>[...]
>>Except for Masse/Ma{\ss}e. :-] 
>
>And what is with names like Mueller, which have to be written with ue if
>the "owner" insists. There are other problems with names, because e and 
>sometimes i is used for making the preceeding vowel longer: Buer, Buir,
>Roisdorf, Troisdorf, Naefe - a collegue of mine. Other problems result from
>the possibility to combine two or more words freely: Portoerstattung,
>Kontoerfassung, etc.
>-- 
>Werner Icking          icking@gmdzi.gmd.de          (+49 2241) 14-2443
>Gesellschaft fuer Mathematik und Datenverarbeitung mbH (GMD)
>Schloss Birlinghoven, P.O.Box 1240, D-5205 Sankt Augustin 1, FRGermany

Das Programm ist nur so gut wie die Mustern die es besitzt.  Also, man
kann sehr leicht mit Woerter wie {Buer, Naefe, ..., Kontoerfassung}
umgehen.  Gustaf Neumann hat einen sehr umfangreichen Wortschatz fuer
`diac' entwickelt, und ich moechte nicht vieles darueber sagen, weil
Gustaf selbst bald darueber schreiben wird, wenn er mal ganz zufrieden
mit seinem Programm geworden ist.

Das Hauptproblem besteht nur darin, wenn ein Ascii-Wort zwei moegliche
`Uebersetzungen' hat, beide sinnvoll.  Z.B. Masse/Ma{\ss}e sowie
M\"uller/Mueller.  Ich gebe zu, dass ich keine richtige Loesung fuer
diese Ambiguitaet weiss.  Man kann auf jeden Fall `Mu{}eller' (sehr
haesslich :-[) schreiben, wenn die Buchstabierung mit `u-e' (nicht
`\"u') gewuenscht wird.  Man kann auch ein weiteres `Fenster' fuer die
Ascii-Zeichenketten benutzen (also, man beobachtet ganze Phrasen statt
nur Woerter).  Mit dieser Methode kann man sofort (ok, ok, es gibt
noch einige Schwierigkeiten) erkennen, dass `in hohem Masse' nur als
`in hohem Ma{\ss}e' zu uebersetzen ist.

Der Herr Mueller aber bleibt ein unsympatischer Mensch.

--d

(for comp.text.tex)

The program is only as good as the patterns it has.  Thus, one can
easily include the appropriate rules for {Buer, Naefe, ...,
Kontoerfassung}.  Gustaf Neumann has developed a very comprehensive
set of patterns, and I don't want to say too much about it, since
Gustaf will talk about it himself before long, as soon as he's
satisfied with his creation.

The chief problem, though, is when an Ascii word has two possible
transliterations, both of them meaningfull.  E.g., Masse/Ma{\ss}e and
M\"uller/Mueller.  I cheerfully agree I haven't found a neat solution
for this ambiguity.  Of course, one could get around this by writing
`Mu{}eller' (eek!) when one wants the spelling with `u-e' (not `\"u').
One could also use a wider `window' when processing the Ascii text
(i.e., one observes phrases rather than just words).  With this
technique, one can immediately (ok, ok, there are problems nokh) that
`in hohem Masse' is only to be transliterated as `in hohem Ma{\ss}e'.

Mr Mueller, however, still remains a spoilsport.

--d