DTD for HTML 5.1

Based on W3C's HTML5.1 specification, a new SGML DTD for HTML5.1 with broad applicability for validation and normalization is developed. While SGML has largely been treated as a legacy technique by the web community for years now, this work shows that not only is SGML capable of describing and parsing HTML5 precisely and elegantly, it's the only game in town being able to tackle formal processing of web content based on an international standard.

W3C has recently published its HTML5.1 specification based on work by WHATWG, which represents a significant milestone in that it's the first published HTML specification release after the initial HTML5 specification two years earlier, hence can be expected to be informed by feedback of a large community of practitioners, and thus can be considered the first published HTML specification with broad applicability since the HTML4 specification in 1999.

W3C's HTML5.1 specification text is addressed at both browser implementers and web authors/developers, hence contains redundancies, and possibly even contradictory rules, insofar as the HTML markup language is specified twice: once as what can be considered the nominal grammar productions for HTML ([html51], chapter 4), and a second time as a partial procedural specification for parsing HTML aimed at implementers of HTML user agents.

In this paper the focus is on the HTML markup language as expressed by the nominal HTML grammar presented in chapter 4 of the specification. The procedural specification for parsing HTML, having recovery rules for parsing almost anything as HTML, isn't considered in this work; it's role within the specification framework is that of avoiding unspecified and potentially malicious behaviour, and hence is considered strictly addressed at browser implementers rather than the larger web community.

In reformulating HTML5's parsing rules as DTD grammar, fundamental rules about HTML's grammatical construction known to the developers of earlier HTML DTDs, but lost in HTML5's grammar presentation are recovered, instilling confidence in the transcription process, and also aspiring to a more conclusive, but in any case more succinct expression of the HTML markup language.

A new formulation of HTML in a formal framework also benefits further HTML specification work. For example, the analysis of HTML5.1's parsing rules, afforded by reformulation as SGML DTD, has uncovered the following apparent flaws in the definition of HTML parsing rules:

  • parts of the the table parsing rules are ambiguous, and not enforceable using W3C's validator software; also, the specification text contains invalid markup with respect to table data

  • the definition of the datalist element puts HTML parsing (the HTML membership decision problem) into a higher computational complexity class than without, which is believed unintended.

Continue reading the detailed analysis