Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!uunet!dlogics!jmd From: jmd@dlogics.COM (Jens M. Dill) Newsgroups: comp.text.sgml Subject: Re: Record boundaries in SGML Summary: Record ends are DATA in MIXED content, but MARKUP in ELEMENT content Message-ID: <693@dlogics.COM> Date: 4 Dec 90 19:15:45 GMT References: <1990Nov21.210152.2631@maytag.waterloo.edu> Organization: Datalogics Inc., Chicago Lines: 151 In article , (Erik Naggum) writes: > In article <1990Nov21.210152.2631@maytag.waterloo.edu>, Gary Pianosi writes: > > > I am able to import hand-edited files into Author/Editor without error, > > but when I try to validate the document, I get the error message: > > > > "Validation Error: Text not allowed here" > > > > wherever there is a new line between an end tag and a start tag. > > ... > > The problem can be reduced, I think, to the problem of the treatment > of Record End in this contrived example: > > > > > 1 `' > 2 `' > 3 `caninus' > 4 `' > 5 `' > 6 `felinus' > 7 `' > 8 `' > > (where ` signifies Record Start, ' Record End for clarity, line > numbers for reference, only) > > According to section 7.6.1, this will be interpreted at the outer > (foo) level as: > > 1 `' > 2 `...' > 5 `...' > 8 `' > > Now, the RE in line 1 is clearly the first RE in the content of foo, > and the RE in line 5 is clearly the last RE in the content of foo. > According to said section, these are to be ignored. > > The problem is the RE in line 2, and the question boils down to this: > > Is this RE recognized as /content/ or as /markup/? > > I believe I understand this to be markup, and thus that it should be > ignored. It seems that Gary's problems stem from some decision > amounting to viewing this as content, in which the RE would imply the > start of a bar element, in which a new bar element is illegal (see > amended note to section 11.2.4), or in which data content is not > valid. > > What am I missing here? (I'm sure it's something.) > What you are missing is a very obscure note added to section 11.2.4 by Amendment 1: NOTE -- It is recommended that "#PCDATA" be used only when data characters are to be permitted anywhere in the content of the element; that is, in a _content model_ where it is the sole token, or where _or_ is the only connector used in any _model group._ This recomendation is made because separator characters, which are recognized as separators in _element content_, are treated as data in _mixed content._ ... The note just about says it all, but it seriously understates the gravity of the problem, and both the example and the sample solution provided are laughably simplistic. The core problem is that Gary has defined an element with "mixed content" (the content model contains both "#PCDATA" and GI's of sub-elements), and has done so in such a way that somewhere in the content model, you come across the situation where sub-element A ends and the only legal things that can follow sub-element A are other elements (#PCDATA is not legal at this point in the model). Now, IF the instance is set up so that -- Element A has an explicit end-tag -- There is a separator character (space, tab, record end) between the end-tag and the start-tag of the next sub-element then the parser attempts to read the separator as data, discovers it cannot match data at this point in the current element, and starts trying to infer omitted start-tags or end-tags that would get it to a point where #PCDATA would be acceptable. At this point the parser gets so confused that the eventual error message has no chance of bearing any relation at all to anything involved in the original problem. The example given in the note quoted above, (x, #PCDATA), is, in my opinion, oversimplified because if this were the whole content model for an element, TWO SUCCESSIVE record ends would be required (before the start-tag of "x") to trigger the problem (As Erik points out, the first, since it follows the start-tag of the containing element, is attributable to markup, and therefore ignored). Some more illuminating examples: (x?, y, #PCDATA) trouble at &#RE; (#PCDATA | (x,y)) ditto (#PCDATA | x+) trouble at &#RE; The solution proposed is to "replace 'PCDATA' with the GI of an element whose content is '#PCDATA' and both of whose tags can be omitted." This ie effective only in the first example above, because in the others, the #PCDATA reference is not contextually required and therefore the start-tag of its replacement GI cannot be omitted in practice. My experience with other solutions is that they are equally weak. The only GOOD solution I know of is to become aware of the problem and avoid writing a DTD that could cause it. In my opinion, this is an area where SGML is seriously flawed, for the following reasons: 1. The problem is subtle and hard to predict. It relies on a designation of "mixed content", which in turn relies on the presence or absence of a single "#PCDATA" in what may be a very complex content model constructed with heavy reliance on parameter entities. I have seen cases where the problem was missed for weeks because there were two very similarly defined elements, one of which ignored record-ends after end-tags and one of which choked on them. It is also a non-trivial problem to study a content model and determine if #PCDATA is, in fact, permitted between any two sub-elements. 2. The problem is one that does not manifest itself until exactly the right combination of circumstances is encountered. This means that a very large collection of instances could be built against a DTD before one of them demonstrated the flaw. This means a lot of recoding of instances unless we can repair the DTD in such a way that existing instances will still parse. 3. There is no good general way to fix the problem in an existing DTD without either requiring changes in the tag structure of existing documents or loosening the DTD so that it accepts #PCDATA in places where it formerly did not. I know this issue has been a source of debate in the standards committee. I may well have missed some important points. But, to me, the note added by amendment 1 has the look and feel of a compromise solution that resulted from a failure to comprehend the full impact of the problem on a user. I would (personally) urge the committee to take another look at the problem. The opinions stated herein are my own; they are not to be interpreted as an official opinion from Datalogics, Inc. *=====* TIME CANNOT BE WASTED *=====* -- Jens M. Dill \ But it can be used for purposes / jmd@dlogics.com \ other than what was intended. / *=============================*