DTDs for HTML Review Drafts January, 2020 and 2023

Is this HTML 6?

Last year's motion for a new HTML candidate recommendation based on HTML Review Draft, published January, 2022 was rejected by W3C's HTML working group, but no new HTML review draft has been widely reviewed or proposed for bringing to recommendation status by W3C since, even though the major reason for last year's rejection has been prominently changed.

Already brought up during wide review of Review Draft January, 2021 was the issue of the Reporting API being unfinished, yet normatively referenced. Ignoring the more serious privacy concerns also raised against the Reporting API, it wasn't clear where to add a notice hinting at the preliminary Reporting API status, as there was no consensus regarding whether W3C could add normative text to upstream WHATWG HTML specifications according to the Memorandum of Understanding Between W3C and WHATWG.

A change that was brought forward through WHATWG's editorial process was a resolution to the long standing issue of the so-called HTML 5 outlining algorithm where Steve Faulkner took it upon himself to edit the WHATWG specification text. The reasoning behind removal, rewriting, or at least warning about the outlining algorithm are well-known: that it isn't, wasn't, and won't be implemented in user agents, nor was used in assistive technologies either.

However, the concept of sectioning roots and the semantic sectioning elements introduced by Ian Hickson such as article and aside into the HTML 5 specification as we know it today is pretty fundamental, and a key innovation for HTML 5, along with allowing hyperlink anchoring around any content rather than just phrasing content.

HTML outlining is a beautiful idea with a distinct SGML flavor using section element inference to bridge HTML "flat-earth markup" with XHTML-era ideas for a rank-less <h> element and for <section> elements. The Producing HTML tutorial explains outlining using SGML in detail. Its applicability isn't affected by HTML outlining changes, because the tutorial is using outlines only as an intermediate vehicle for generating navigational tables of content. Unlike in SGML proper, in (legacy) HTML outlining, no material DOM elements are inferred, but the outlining algorithm is merely used as a device for establishing equivalence of DOMs with and without section elements and of the desired accessibility semantics.

Obviously, Ian Hickson knew his SGML well, as is also evident by his initial HTML 5 parsing algorithm description capturing SGML tag inference with remarkable precision. Unfortunately, the hard-coded formulation of HTML parsing he left behind has languished due to its presentation being littered with explicitly enumerated elements on where eg. parent content models are closed, rather than referencing the underlying generic SGML parsing rules (or at least element categories which the specification itself goes to great lengths defining). As a consequence, subsequent additions of HTML elements weren't reflected in tag inference rules already in the HTML version published as W3C HTML 5.1 right after the initial release with respect to the details, figcaption, figure, and menu elements. Basically, HTML's parsing rules were already ossified at that point.

We also note that Review Draft January, 2020 now accepts multiple main sections where in previous versions W3C editors made a point of enforcing a singular mapping to a main ARIA landmark role (main was even introduced for that purpose). The wider discussion generally evolves around design options available to markup languages since the beginning, namely whether to represent some piece of information as element name, ID value, ARIA role attribute, CSS class, or other attribute value.

Which is why SGML has link process definitions (LPDs), providing the necessary build-in support for exactly the kind of attribute to element mappings required for outlining and ARIA role mapping, and for presenting the results as logical document "views" as part of the core language. Oddly, in the HTML context, CSS isn't even considered for producing ARIA mappings, when it supposedly should bridge presentational and structural gaps according to its proponents.

Considering the HTML outlining algorithm and the concept of sectioning roots had been in the HTML specification for nearly two decades, their unceremonious removal or deprecation is problematic. The reasoning behind the change seems sound enough, but of course doesn't change the large corpus of existing content produced under the assumption of section inference implied by heading elements. Its departure would thus seem to warrant a prominent major-version bump or renaming as the whole point of the WHATWG HTML initiative was backwards compatibility with existing content, with HTML 5 now being in use long enough to have created its own legacy.

To complicate matters, HTML Review Draft published January, 2020 and accepted as W3C recommendation, includes the hgroup element. hgroup had been part of WHATWG HTML specifications for the longest time, but was deliberately not included in previous W3C recommendations. Its silent inclusion in W3C's 2021 recommendation as an element allowing heading elements with multiple, different ranks, the only purpose of which is to stop the (legacy) outlining algorithm from inferring sections from their presence, is inconsistent with all previous W3C recommendations. In fact, hgroup is changed in Review Draft January, 2023, again to allow only a single heading element (with alternate titles going into p elements) as part of the major change to HTML outlining already described above.

At the time of this writing, the HTML working group hasn't started wide review of WHATWG's Review Draft January, 2023, on which the next HTML DTD would be based (including the upstream change to hgroup), nor on other review drafts, as the HTML working group charter would imply and as was done in previous years. The status of the HTML working group, and that of W3C, Inc. in general considering its change in legal form and its ability and commitment to bring broad consensus behind future HTML recommendations must be questioned at this point.

For these reasons, HTML Review Draft published 2020, January 29th, endorsed as W3C recommendation, remains the preferred DTD for HTML 5; the DTD for HTML Review Draft published 2023, January 16th is provided here purely for completeness.

Apart from the hgroup-related outlining changes discussed, modifications in Review Draft January, 2023 relative to Review Draft January, 2020, are rather modest, however, and include just the following items:

  • changed tag omission rules for body not allowing end-tag omission on noscript child content which seems arbitrary

  • summary now allowing any mix of heading and phrasing child content rather than either a single heading element or phrasing content as before when the motivation for either choice is really unclear; considering WHATWG issue 2272 and 8864, one can expect further change to the effect that certain type of interactive content is prohibited (similar concerns are also raised with respect to hyperlink anchors within anchors)

  • menuitem being reintroduced albeit with another content model

  • the newly formulated constraint that <a> (anchor) elements disallow descendant elements having their tabindex attribute specified can't be represented in SGML DTD anyway

  • contrast handling of the rt and rtc ruby-supporting element (removed because only supported by Firefox) with dialog, which is unchanged in Review Draft 2023 but was already included in Review Draft 2020 even though only implemented by Chrome (cf. issue 4937), a fact that WHATWG editors aren't happy with either

hence there's almost no progress to show for three years, and the changes that made it are controversial at best and will without exception require further adaptions in the future.

As demonstrated by these changes, a process where a snapshot at a certain point in time is mechanically taken without quality assurance or any other redactional workflow whatsoever, nor W3C or anyone else reviewing, tends to result in higly volatile specification releases with changes either incidental in nature (such as for hgroup and also legend element), or merely a artifact of bureaucracy (such as dialog, rt, or menuitem).

The impression emerging from the presence of hgroup in HTML Review Draft 2020 (and the W3C recommendation based on it, no less) is that it has been merely smuggled in while Steve Faulkner wasn't looking, which doesn't reflect favorably on W3C's process and scrutinity.

Consensus is also increasingly difficult to reach due to the HTML specification aggregating additional functionality and sheer volume, as can be seen by last year's rejection of the Reporting API, with the awkward W3C/WHATWG Memorandum Of Understanding seemingly not helping progress either.

For all these reasons, it's expected that HTML Review Draft 2020 will be the final HTML 5 version published and endorsed by W3C. Even if W3C will be able to organize consensus and publish a new version based on current WHATWG work, that version, due to the changes already discussed, will have to represent a new major version of the HTML markup language; which it kind of has, by simply not being called HTML 5 anymore.

sgmljs.net release 0.2.5-beta

Pre-production release for sgmljs.net SGML.

sgmljs.net 0.2.5-beta has been released with the HTML Review Draft 200129 mini-DTD as embedded HTML DTD, resolvable via the about:legacy-compat system identifier.

See Parsing HTML for a tutorial on parsing HTML and checking out the updated embedded Mini-DTD.

See also release notes for 0.0.10-alpha.