20 years of XML

Today marks the 20th anniversary of the publication of the Extensible Markup Language (XML) 1.0 recommendation by the World Wide Web Consortium (see also the original Press Release and the most recent XML recommendation at https://www.w3.org/TR/REC-xml/).

As the original specification text itself states:

The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML has been designed for ease of implementation and for interoperability with both SGML and HTML.

Moreover, the specification text states that

This document specifies a syntax created by subsetting an existing, widely used international text processing standard (Standard Generalized Markup Language, ISO 8879:1986(E) as amended and corrected) for use on the World Wide Web.

While XML's original goal - to replace HTML on the Web - wasn't achieved, it had tremendous success as the basis for countless standards in publishing, as well as in enterprise, eGovernment, and healthcare data exchange. Many, or even most, of the applications XML has found its way into, however, have little or nothing to do with SGML's and XML's original purpose to encode semistructured text; rather, XML is being used as generic structured data format most of the time.

Data-oriented uses of XML, as opposed to text-centric applications, can usually be spotted by the absence of mixed content and non-trivial content models.

Mixed content is markup content containing both (significant) element and text content at the same hierarchy level. For example, plain flow content in HTML consists of text, optionally intermixed with elements for marking up special semantics such as the a (anchor) or the em (emphasis) elements, among others.

Non-trivial content models constrain and assign meaning to the position where elements and text content appears relative to other elements within the same hierarchy level. For example, the HTML dl (definition list) element is commonly understood to have one or more dt (definition term) elements, followed by zero or more dd (definition) elements as child content. More specific vocabularies than HTML, such as for academic or press articles, impose much more elaborate content models.

On the other hand, almost all XML vocabularies for data exchange merely use XML as a vehicle for encoding structured (rather than semistructured) data. In structured data, elements of different types at the same hierarchy level are either not permitted at all, or their order relative to each other doesn't matter.

Another characteristic of projects using data-oriented XML is an ongoing discussion as to whether to use markup attributes or elements for encoding a particular piece of information. Whereas in semistructured text data, elements are used for displayed content, and attributes are used for formatting or other properties of element content rather than for displayed text as such, this distinction doesn't make sense for structured data which doesn't have an implicit concept of "rendering content to a user".

In the design of XML as an SGML subset, a ruling principle was that XML should allow DTD-less markup. A DTD (document type declaration) in SGML and XML contains declarations for the expected or allowed elements and attribute markup, among other things. Classic SGML always required a DTD; the XML specification publication was accompanied by an update to the SGML standard (ISO 8879:1986/Cor.2:1999(E)) to allow SGML to be DTD-less as well, and for other XML alignments, so that XML could remain a proper SGML subset.

The design goal of XML to be parsable without a DTD becomes apparent by the removal of SGML declared content for EMPTY and unparsed CDATA content from XML. These features are needed to parse HTML's so-called void elements such as the img and the br elements, and for dealing with unescaped angle brackets in HTML's script and style element content.

Along with attribute short forms, another major feature only possible in the presence of markup declarations, and hence dropped from XML, is SGML tag inference (SGML tag omission). Tag inference is what makes the following

<!DOCTYPE html>
<title>A valid HTML document</title>
<p>Body Text

a valid HTML document, and treated as if

<!DOCTYPE html>
<html>
  <head>
    <title>A valid HTML document</title>
  </head>
  <body>
    <p>Body Text</p>
  </body>
</html>

had been specified.

SGML, but not XML, also has link processes, which can be understood as a mechanism for describing style sheets reusing SGML's element and attribute declaration syntax. As is well known, in HTML this role has been taken over by CSS.

The syntactic difference between HTML and CSS has given rise to the notion of semantic HTML, which considers rendering properties - the original use cases for attributes in the first place - inappropriate to be represented as HTML markup attributes, to shift this role entirely to an ad-hoc syntax completely different from that for content markup instead. Language purism of HTML semanticist is all the more surprising since at the same time, HTML specification authors have distanced themselves from established markup standards and tools. Confronted with the task of maintaining specification text as massive as WHATWG's HTML specification itself, it isn't unreasonable to expect its authors would welcome basic means of document engineering for applying consistency checks, and to check the HTML grammar rules under specification, both of which are sources of errors that have made their way into published specification text, as reported in an earlier blog post.

It appears that generic HTML markup validation rules don't really matter much as even the specification text itself isn't valid with respect to its own parsing rules. The HTML vocabulary itself has such a broad scope as to be almost useless as a tight markup vocabulary for any particular authoring purpose, even if the specification text itself contains numerous examples where HTML is portrayed in the role of an authoring language. But, unconvincingly, even the specification text begins with the words

<!DOCTYPE html>
  <!-- Note: This file is NOT HTML, it's a
       proprietary language that is then
       post-processed into HTML. -->

Beyond performing HTML validation, a more useful application of classic markup techniques such as content model grammars is to customize the generic HTML content models into grammars for more specific content types, such as for blog posts. While these techniques were envisioned for XHTML (the XML-ized variant of HTML), 20 years after the inception of XML, SGML remains the only standardized markup meta-language capable of doing so with HTML.

We don't need to be concerned about XML's fate, though, since its use behind the scenes is so massive that it isn't going anywhere soon, if ever.

Happy birthday, XML, and greetings to the speakers and attendees of the XML Prague 2018 conference who I'm sure are going to have a special event tonight.