Newsgroups: comp.text.sgml Path: utzoo!sq!dns From: dns@sq.sq.com (David Slocombe) Subject: Re: SGML translators Message-ID: <1990Sep14.181719.12244@sq.sq.com> Organization: SoftQuad Inc., Toronto References: Distribution: comp Date: Fri, 14 Sep 90 18:17:19 GMT Lines: 140 In article Bjorn.Larsen@usit.uio.no writes: >Does anybody have a description of the SGML-based translators available? >I'm interested in translators such as > > WP5.0 <-> SGML > RTF <-> SGML > LaTeX <-> SGML > troff <-> SGML What is requested here is in general not possible without the user supplying substantial additional information -- in fact this touches upon the key motivation for SGML. SGML == Standard Generalized Markup Language, IS 8879-1986. See my parallel posting "How to obtain info on SGML" DTD == Document Type Definition. SGML tells how to create one. This specifies the "grammar" of a class of documents. A particular document is an "instance" of the language specified by the "grammar", just like in your programming-languages course. The need for SGML is demonstrated by examining the problem of translating a troff document (for example) into an SGML document: If a computer program looks at a file containing a troff document, it will see things like... .sp .5v .ti 2m text text text .... ...text text. .sp .5v Now we may decide, in context, that these formatting codes are formatting a paragraph (we *visualize* the effect of the codes!), but they *might* be formatting a "note" or a cell of a table or whatever. And this must be about the simplest case. In general it is a kind of AI problem to deduce the logical structure of a document from its formatting codes, and a task that requires considerable "training" before it can be done algorithmically with any accuracy! (Of course most troff documents use macro-calls, but this only hides the problem a little: someone still has to map the macro-calls to the SGML elements, and this may be one-to-many unless the designer of the macro package was already thinking in an SGML way. If he was, then SGML contributes to him a rigorousness and a software support that he has never had before.) In fact there is a company that specializes in software to do exactly this. They are: Avalanche Development Company 947 Walnut Street Boulder, Colorado 80302 (303) 449-5032 FAX (303) 449-3246 Their FastTAG product accepts input from WP4.2 and WP5.0, OCR formats, DCA/RFT files, Microsoft Word, print-image files, Calera PDA files and Shaftstall Media Conversion files. I think they are expanding the list all the time. BUT... you have to do considerable work to coach FastTAG, because by itself it cannot be expected to know just what logical elements make up your documents (i.e. it cannot intuit the DTD), *and* it cannot guess at the format of each element in that DTD. So you have to tell it these things. This is usually practical only if you are going to convert a body of documents from a particular formatted form to SGML. Naturally! That's why SGML is so important: it is a way for document creators to supply this valuable information about their work that hitherto has been visible to the human reader (hopefully) but not available to computer programs. Instead of coding up your documents with formatting codes which result in a visible image that your brain interprets to mean a certain logical structure, you code your documents with the logical structure, and then map that structure to formatting instructions in a separate operation. The documents themselves then are much more "computable" as data-structures, *and* you can take the same document/data-structure and map it to different visual representations at different times for different purposes. Or even map it to different formatting languages (e.g. troff at one site, Tex at another site). Or load it into a database (mapping the logical structure into database-update language). But again note that going from SGML to troff, for example, requires that you specify just what troff codes or macros you want used for each SGML logical structure. There is nothing in the SGML form of the document that binds to a particular visual representation. So SGML->formatter-language cannot be automatic unless you supply additional information. At least this *can* be done with great reliability, which is often *not* the case for formatter-language->SGML. The mapping from SGML to a formatter-language is usually done using an SGML parser/translator, i.e., a program that parses the SGML documents (using a supplied Document Type Definition) and writes to its output suitable formatting codes (or the macro-calls that represent them) to typeset the document in a specific format. The user must either supply a mapping to formatting codes to produce the particular "look" desired, or supply a mapping to macro-calls and then write a macro-package that has the same effect. In either case, the SGML parser has to be told what to put out. The parser has the advantage that a document that does not conform in detail to the DTD simply won't be translated, just as is the case with a C compiler. This greatly eases the burden on the writer of the macro package, who doesn't have to make his macros robust in the face of incorrect input! As to available parsers, I quote from a comp.text posting by my colleague Yuri Rubinsky only a short time ago: Today, the most popular parsers, which are generally conceded to also be the most conformant [to the Standard], are those of Software Exoterica (of Ottawa Canada), licensed by Frame, Arbortext and Intergraph; and of Sobemap (of Brussels Belgium, marketed by Yard Software of Chippenham Wiltshire UK), licensed by Agfa Compugraphic CAPS, Interleaf, Context and Xyvision. We have made available to our consulting clients the parser from Author/Editor, which is optimized to work with our SoftQuad Publishing Software sqtroff component. Hope all this helps someone... David. ---------------------------------------------------------------- David Slocombe (416) 963-8337 Vice-President, Research & Development (800) 387-2777 (from U.S. only) SoftQuad Inc. uucp: {uunet,utzoo}!sq!dns 720 Spadina Ave. Internet: dns@sq.com Toronto, Ontario, Canada M5S 2T9 Fax: (416) 963-9575