Path: utzoo!utgpu!news-server.csri.toronto.edu!cs.utexas.edu!sdd.hp.com!wuarchive!uunet!dlogics!jmd
From: jmd@dlogics.COM (Jens M. Dill)
Newsgroups: comp.text.sgml
Subject: Re: Record boundaries in SGML
Summary: Record ends are DATA in MIXED content, but MARKUP in ELEMENT content
Message-ID: <693@dlogics.COM>
Date: 4 Dec 90 19:15:45 GMT
References: <1990Nov21.210152.2631@maytag.waterloo.edu> <ENAG.90Nov29012001@hild.ifi.uio.no>
Organization: Datalogics Inc., Chicago
Lines: 151

In article <ENAG.90Nov29012001@hild.ifi.uio.no>, (Erik Naggum) writes:
> In article <1990Nov21.210152.2631@maytag.waterloo.edu>, Gary Pianosi writes:
> 
> >  I am able to import hand-edited files into Author/Editor without error,
> >  but when I try to validate the document, I get the error message:
> >
> >	   "Validation Error:  Text not allowed here"
> >
> >  wherever there is a new line between an end tag and a start tag.
> >  ...
>
> The problem can be reduced, I think, to the problem of the treatment
> of Record End in this contrived example:
> 
> 	<!element foo (bar+)>
> 	<!element bar (#PCDATA)>
> 
>      1	`<foo>'
>      2	`<bar>'
>      3	`caninus'
>      4	`</bar>'
>      5	`<bar>'
>      6	`felinus'
>      7	`</bar>'
>      8	`</foo>'
> 
> (where ` signifies Record Start, ' Record End for clarity, line
> numbers for reference, only)
> 
> According to section 7.6.1, this will be interpreted at the outer
> (foo) level as:
> 
>      1	`<foo>'
>      2	`<bar>...</bar>'
>      5	`<bar>...</bar>'
>      8	`</foo>'
> 
> Now, the RE in line 1 is clearly the first RE in the content of foo,
> and the RE in line 5 is clearly the last RE in the content of foo.
> According to said section, these are to be ignored.
> 
> The problem is the RE in line 2, and the question boils down to this:
> 
> 	Is this RE recognized as /content/ or as /markup/?
> 
> I believe I understand this to be markup, and thus that it should be
> ignored.  It seems that Gary's problems stem from some decision
> amounting to viewing this as content, in which the RE would imply the
> start of a bar element, in which a new bar element is illegal (see
> amended note to section 11.2.4), or in which data content is not
> valid.
> 
> What am I missing here?  (I'm sure it's something.)
> 

What you are missing is a very obscure note added to section 11.2.4 by 
Amendment 1:

    NOTE -- It is recommended that "#PCDATA" be used only when data characters
    are to be permitted anywhere in the content of the element; that is, in a
    _content model_ where it is the sole token, or where _or_ is the only
    connector used in any _model group._

    This recomendation is made because separator characters, which are 
    recognized as separators in _element content_, are treated as data in
    _mixed content._ ...

The note just about says it all, but it seriously understates the gravity
of the problem, and both the example and the sample solution provided are
laughably simplistic.

The core problem is that Gary has defined an element with "mixed content"
(the content model contains both "#PCDATA" and GI's of sub-elements), and
has done so in such a way that somewhere in the content model, you come 
across the situation where sub-element A ends and the only legal things
that can follow sub-element A are other elements (#PCDATA is not legal at
this point in the model).  Now, IF the instance is set up so that

   -- Element A has an explicit end-tag
   -- There is a separator character (space, tab, record end)
      between the end-tag and the start-tag of the next sub-element

then the parser attempts to read the separator as data, discovers it cannot
match data at this point in the current element, and starts trying to infer
omitted start-tags or end-tags that would get it to a point where #PCDATA
would be acceptable.  At this point the parser gets so confused that the
eventual error message has no chance of bearing any relation at all to anything
involved in the original problem.

The example given in the note quoted above, (x, #PCDATA), is, in my opinion,
oversimplified because if this were the whole content model for an element,
TWO SUCCESSIVE record ends would be required (before the start-tag of "x")
to trigger the problem (As Erik points out, the first, since it follows the 
start-tag of the containing element, is attributable to markup, and therefore
ignored).  Some more illuminating examples:

    (x?, y, #PCDATA)    trouble at </X>&#RE;<Y>
    (#PCDATA | (x,y))   ditto
    (#PCDATA | x+)      trouble at </X>&#RE;<X>

The solution proposed is to "replace 'PCDATA' with the GI of an element whose
content is '#PCDATA' and both of whose tags can be omitted."  This ie effective
only in the first example above, because in the others, the #PCDATA reference
is not contextually required and therefore the start-tag of its replacement GI
cannot be omitted in practice.

My experience with other solutions is that they are equally weak.  The only
GOOD solution I know of is to become aware of the problem and avoid writing
a DTD that could cause it.


In my opinion, this is an area where SGML is seriously flawed, for the
following reasons:

1.  The problem is subtle and hard to predict.  It relies on a designation
    of "mixed content", which in turn relies on the presence or absence of
    a single "#PCDATA" in what may be a very complex content model constructed
    with heavy reliance on parameter entities.  I have seen cases where the
    problem was missed for weeks because there were two very similarly
    defined elements, one of which ignored record-ends after end-tags and
    one of which choked on them.  It is also a non-trivial problem to
    study a content model and determine if #PCDATA is, in fact, permitted
    between any two sub-elements.

2.  The problem is one that does not manifest itself until exactly the right
    combination of circumstances is encountered.  This means that a very large
    collection of instances could be built against a DTD before one of them
    demonstrated the flaw.  This means a lot of recoding of instances unless
    we can repair the DTD in such a way that existing instances will still
    parse.

3.  There is no good general way to fix the problem in an existing DTD without
    either requiring changes in the tag structure of existing documents or
    loosening the DTD so that it accepts #PCDATA in places where it formerly
    did not.

I know this issue has been a source of debate in the standards committee.
I may well have missed some important points.  But, to me, the note added
by amendment 1 has the look and feel of a compromise solution that resulted
from a failure to comprehend the full impact of the problem on a user.
I would (personally) urge the committee to take another look at the problem.


The opinions stated herein are my own; they are not to be interpreted as an
official opinion from Datalogics, Inc.


*=====* TIME CANNOT BE WASTED *=====*       -- Jens M. Dill
 \ But it can be used for purposes /           jmd@dlogics.com
  \ other than what was intended. /
   *=============================*