20 years of XML
Today marks the 20th anniversary of the publication of the
[Extensible Markup Language (XML) 1.0](https://www.w3.org/TR/1998/REC-xml-19980210)
recommendation by the [World Wide Web Consortium](https://www.w3.org)
(see also the original [Press Release](https://www.w3.org/Press/1998/XML10-REC)
and the most recent XML recommendation at ).
As the original specification text itself states:
> The Extensible Markup Language (XML) is a subset of SGML that is
> completely described in this document. Its goal is to enable generic
> SGML to be served, received, and processed on the Web in the way that
> is now possible with HTML. XML has been designed for ease of implementation
> and for interoperability with both SGML and HTML.
Moreover, the specification text states that
> This document specifies a syntax created by subsetting an existing,
> widely used international text processing standard (Standard Generalized
> Markup Language, ISO 8879:1986(E) as amended and corrected) for use on
> the World Wide Web.
While XML's original goal - to replace HTML on the Web - wasn't achieved,
it had tremendous success as the basis for countless standards
in publishing, as well as in enterprise, eGovernment, and healthcare data
exchange. Many, or even most, of the applications XML has found its way
into, however, have little or nothing to do with SGML's and XML's original
purpose to encode semistructured *text*; rather, XML is being used as
generic structured data format most of the time.
Data-oriented uses of XML, as opposed to text-centric applications, can
usually be spotted by the absence of *mixed content* and *non-trivial
Mixed content is markup content containing both (significant)
element and text content at the same hierarchy level. For
example, plain flow content in HTML consists of text, optionally
intermixed with elements for marking up special semantics such
as the *a* (*anchor*) or the *em* (*emphasis*) elements, among others.
Non-trivial content models constrain and assign meaning to the
position where elements and text content appears relative to
other elements within the same hierarchy level. For example, the
HTML *dl* (*definition list*) element is commonly understood to have
one or more *dt* (*definition term*) elements, followed by zero or
more *dd* (*definition*) elements as child content. More
specific vocabularies than HTML, such as for academic or press
articles, impose much more elaborate content models.
On the other hand, almost all XML vocabularies for data exchange
merely use XML as a vehicle for encoding structured (rather than
semistructured) data. In structured data, elements of different
types at the same hierarchy level are either not permitted at
all, or their order relative to each other doesn't matter.
Another characteristic of projects using data-oriented XML is an
ongoing discussion as to whether to use markup attributes or elements
for encoding a particular piece of information. Whereas in semistructured
text data, elements are used for displayed content, and attributes are
used for formatting or other properties of element content rather than
for displayed text as such, this distinction doesn't make sense for
structured data which doesn't have an implicit concept of "rendering
content to a user".
In the design of XML as an SGML subset, a ruling principle was
that XML should allow *DTD-less* markup. A DTD (document type declaration)
in SGML and XML contains declarations for the expected or allowed
elements and attribute markup, among other things. Classic SGML
always required a DTD; the XML specification publication was
accompanied by an update to the SGML standard
to allow SGML to be DTD-less as well, and for
other XML alignments, so that XML could remain a proper SGML
The design goal of XML to be parsable without a DTD
becomes apparent by the removal of SGML *declared content*
for `EMPTY` and unparsed `CDATA` content from XML.
These features are needed to parse HTML's so-called *void elements*
such as the *img* and the *br* elements, and for dealing
with unescaped angle brackets in HTML's *script* and *style*
Along with attribute short forms, another major feature only possible
in the presence of markup declarations, and hence dropped from XML,
is SGML *tag inference* (SGML tag omission). Tag inference is what
makes the following
A valid HTML document
a valid HTML document, and treated as if
A valid HTML document
had been specified.
SGML, but not XML, also has *link processes*, which can be
understood as a mechanism for describing style sheets reusing SGML's
element and attribute declaration syntax. As is well known,
in HTML this role has been taken over by CSS.
The syntactic difference between HTML and CSS has given
rise to the notion of *semantic HTML*, which considers
rendering properties - the original use cases for attributes
in the first place - inappropriate to be represented as HTML
markup attributes, to shift this role entirely to an ad-hoc
syntax completely different from that for content markup instead.
Language purism of HTML semanticist is all the more surprising
since at the same time, HTML specification authors have
distanced themselves from established markup
standards and tools. Confronted with the task of maintaining
specification text as massive as WHATWG's HTML specification
itself, it isn't unreasonable to expect its authors would welcome
basic means of document engineering for applying
consistency checks, and to check the HTML
grammar rules under specification, both of which are sources
of errors that have made their way into published specification
text, as reported in an [earlier blog post](/blog/blog1701.html).
It appears that generic HTML markup validation rules don't really
matter much as even the specification text itself isn't valid
with respect to its own parsing rules. The HTML vocabulary
itself has such a broad scope as to be almost useless as a
tight markup vocabulary for any particular authoring purpose,
even if the specification text itself contains numerous examples
where HTML is portrayed in the role of an authoring language.
But, unconvincingly, even the specification text begins with
Beyond performing HTML validation, a more useful application
of classic markup techniques such as content model grammars is
to customize the generic HTML content models into grammars
for more specific content types, such as for blog posts. While
these techniques were envisioned for XHTML (the XML-ized variant
of HTML), 20 years after the inception of XML, SGML remains the only
standardized markup meta-language capable of doing so with HTML.
We don't need to be concerned about XML's fate, though, since
its use behind the scenes is so massive that it isn't going
anywhere soon, if ever.
**Happy birthday, XML**, and greetings to the speakers and
attendees of the [XML Prague 2018 conference](http://www.xmlprague.cz/)
who I'm sure are going to have a special event tonight.