Path: utzoo!attcan!utgpu!watmath!watdragon!watsol!tbray From: tbray@watsol.waterloo.edu (Tim Bray) Newsgroups: comp.text Subject: What SGML is and isn't Keywords: SGML, document processing, markup Message-ID: <15110@watdragon.waterloo.edu> Date: 13 Jul 89 17:03:19 GMT Sender: daemon@watdragon.waterloo.edu Reply-To: tbray@watsol.waterloo.edu (Tim Bray) Organization: New Oxford English Dictionary Project, U. of Waterloo, Ontario Lines: 52 Some SGML talk churning around the newsgroup again. This is a good thing, as people who are in the text-by-computer trade need to be thinking about the issues that SGML raises. Here at the New OED project, we have grappled with the structuring and management of many different kinds of text, the most interesting perhaps being the 550-Mb, highly structured dictionary itself. We have used embedded markup, and have built (& are now selling) a variety of software tools which are designed for such data resources. We describe the markup and our software as `SGML-like', and indeed it looks like SGML, and some of the things our tools do are reminiscent of the services provided by the SGML-oids. But we are not, nor will be be, fully SGML-compliant. The SGML concept is based on descriptive markup, a wonderful idea [see Coombs et al in Nov. '87 CACM] and one which is necessary for any serious computer text processing. But the SGML standard itself is horribly flawed and permits some things which are unhelpful and even dangerous. The details are too lengthy and sordid to go into here, but I can talk in detail on request. It also has one serious design flaw which I shall discuss briefly here: that for any document to be SGML-compliant, there must exist a Document Type Def (DTD), which is a formal grammar prescribing the syntax and structure of the document. Sounds fine. But there is a large class of existing documents (including dictionaries, other reference works, legislation, technical documentation) for which it is either impossible or prohibitively difficult to write a DTD, simply because the number of inconsistencies is so great (even if proportionally speaking their frequency of occurrence is low). The OED is an example of such a document. I know lots of others. So are we to discard such documents just because we can't write DTD's for them? Don't be silly. They are structured, and in fact generally well structured. What is required is software tools that use that structure to support getting work done without letting a requirement for a prescriptive grammar getting in the way. (This is what we try to do). There is also the philosophical issue that arises when the editor of the OED comes to me and says: "I want to put an author's name here because this is a special case in the English language." Do I say: 1. "You can't. You have to live by the grammar we predefined" or 2. "OK, we'll fix the grammar for this one special case". Yecch and yecch. Especially given the fact that a person such as an OED editor usually has a pretty good grasp of what he or she is doing and I, a computer weenie, feel pretty uncomfortable telling him or her how to structure a dictionary. So what about SGML? As several others have mentioned, its most important potential role is as a truly portable interchange medium. Something that is desperately needed. And it may succeed there. But I still worry about the low quality of its design getting in the way. Cheers, Tim Bray, New OED Project, U of Waterloo (tbray@watsol.waterlool.edu)