SGML

HTML5.1 DTD Reference

W3C HTML5(http://sgmljs.net/schemas/sgml-cms/w3c/html5.dtd)
DTD for W3C HTML 5 (deprecated); while the DTD itself is deprecated, the text describes the construction of the HTML 5 DTD in detail; see later versions for important revisions
W3C HTML5.1(http://sgmljs.net/schemas/sgml-cms/w3c/html51.dtd) (http://sgmljs.net/schemas/sgml-cms/w3c/html51mini.dtd)
Full DTD and Minimal DTD for W3C HTML 5.1 (superseded by HTML 5.2)
W3C HTML5.2 (http://sgmljs.net/schemas/sgml-cms/w3c/html52.dtd) (http://sgmljs.net/schemas/sgml-cms/w3c/html52mini.dtd)
Full DTD and Minimal DTD for W3C HTML 5.2 (superseded by HTML RD 200129)
HTML Review Draft 200129 (http://sgmljs.net/schemas/sgml-cms/w3c/html200129.dtd) (http://sgmljs.net/schemas/sgml-cms/w3c/html200129mini.dtd)
Full DTD and Minimal DTD for HTML Review Draft 200129; note the Minimal DTD is the declaration set resolved via the 'about:legacy-compat' system identifier in sgmljs.net SGML
HTML Review Draft 230116 (http://sgmljs.net/schemas/sgml-cms/w3c/html230116.dtd) (http://sgmljs.net/schemas/sgml-cms/w3c/html230116mini.dtd)
Full DTD and Minimal DTD for HTML Review Draft 230116 (experimental)
Note: the W3C HTML 5 series DTDs are deprecated and superseded by the HTML Review Draft (200129 and newer) DTDs. The current HTML RD 200129 Minimal DTD assumes SGML IMPLYDEF ELEMENT ANYOTHER behaviour with respect to undeclared elements as defined in ISO/IEC 8879:1986/Cor.2:1999(E) in support of SVG and MathML foreign vocabularies and custom elements. While IMPLYDEF ELEMENT ANYOTHER is supported by sgmljs.net SGML, it might not be by other SGML software such as OpenSP. If a minimal DTD for use with OpenSP is desired, use the legacy Minimal HTML 5.1 DTD. Note this only affects the minimal but not full DTD variants.

Overview

This is an extended, corrected version of The HTML 5.1 DTD presented at the XML Prague 2017 conference. Compared to the initial revision for W3C HTML5, it introduces a "minimal" DTD for practical HTML parsing and processing, and takes a lenient approach for open problems described in the earlier analysis related to script and style data content and namecasing of elements, and, in line with historic DTDs for HTML, gives up modelling HTML's id and href attributes as SGML attributes with declared value ID and IDREF, respectively, and insteads treats these as ordinary CDATA values. Moreover, the former nomenclature of a "Restrictive" and "Permissive" DTD was changed in favour of a "Full" and "Minimal" DTD, respectively (the "restrictive"/"permissive" nomenclature was unfortunate, as it clashes with the use of these terms in the historic HTML 4.01 DTD for different concepts).

The Full HTML5.1 DTD is a transcription of WHATWG's HTML specification prose into an SGML DTD. If follows WHATWG snapshots as published by W3C (WHATWG itself doesn't publish stable snapshots of its specifications). The Full DTD covers all elements of HTML, SVG, MathML, and the ARIA attributes, and its construction is described in the reference for the W3C HTML 5 DTD, with only modifications for version 5.1 described in this document.

The Minimal HTML5.1 DTD is a compact DTD containing only essential parsing rules for HTML. As only HTML's special rules for HTML void elements and enumerated attributes are included (others being admitted freely), the Minimal HTML5.1's DTD usefulness for validation purposes is limited. Instead, the purpose of the Minimal HTML5.1 DTD is to provide a minimal bundled declaration set for content parsing and production tasks for modern and idiomatic HTML in sgmljs.net and other SGML software with support for resolving declaration sets via catalog resolution (in sgmljs.net, the Minimal HTML5.1 DTD is resolved and accessed by the about:legacy-compat system identifier).

Casual readers will most likely be interested in the Minimal DTD; its introductory text also contains an easy-to-follow introduction to tag omission and other forms of shorthand markup in HTML and SGML.

These DTDs are primarily useful for checking/validating and normalizing HTML. In SGML applications, it's common (and the point of using SGML in the first place) to define custom DTDs containing application-specific grammar and processing rules, including for generic HTML applications such as outlining, metadata extraction, search result formatting, paging, templating, etc. It is expected (and explicitly permitted) to create custom DTDs based on the HTML5.1 DTDs provided here.

The Full HTML5.1 DTD
is a straightforward translation of the specification text for HTML's content model and tag omission rules into DTD grammar rules; the specification text is included as SGML comment along with the translated DTD rule for reference; where HTML content model rules are represented incompletely, a note is included in the SGML comment for the declaration as well

doesn't include attribute default values and predefined HTML entities (character entity references) as explained in attribute defaults

is designed to be used with the restrictive variant of the SGML declaration for HTML5.1.

is essentially already described in detail in the previous reference for W3C HTML5 and only brought up to date with W3C HTML 5.1

The Minimal HTML5.1 DTD
only contains declarations needed for parsing HTML void elements, omitted attribute names, commonly used unquoted attribute values, and omitted start- and end-element tags allowed by HTML

as opposed to the Full DTD, is designed to be used with the permissive variant of the SGML Declaration for HTML5 allowing undeclared elements and can only be used with SGML systems supporting WebSGML/ISO 8879 Annex additions; specifically, it makes use of WebSGML's IMPLYDEF ELEMENT ANYOTHER feature (to be able to infer omitted end tags on p, li, and other elements), and of IMPLYDEF ATTRIBUTE YES (allowing undeclared attributes to be used)

The Mimimal HTML5 DTD

The Minimal HTML5 DTD is an extract of the Full HTML5.1 DTD, and edited to make use of IMPLYDEF ELEMENT ANYOTHER and other WebSGML features.

IMPLYDEF ELEMENT ANYOTHER is an SGML declaration property allowing (like IMPLYDEF ELEMENT YES), undeclared elements to occur in document instances. If an undeclared element x is encountered in a document, it will be treated as if it were declared <!ELEMENT x - O ANY>, which means that any element or character data is permitted as child content of x, and moreover, that x's end-element tag can be omitted.

In regular SGML, end-element tag omission is only considered if either

  • a parent element's end-element tag is encountered, or
  • no more element content is expected within the parent (ie. because the declared content model of the parent element is completed at the context position already and doesn't admit optional elements at its end), or
  • an element is encountered that isn't allowed to occur (ie. because it is being excluded at the context position).

Declaring an end-tag omisssion indicator (the letter O in the declaration) can't have consequences for the latter two cases if neither a content model nor content exclusion exceptions have been declared on the respective element. WebSGML's implied default declaration for elements, <!ELEMENT x - O ANY>, has neither; however, WebSGML's IMPLYDEF ELEMENT ANYOTHER feature, when activated, will treat undeclared elements as completed and infer an end-element tag (if missing in content), if an element is immediately followed by a start-element tag for the same element.

Tag omission on paragraphs

For example, consider end-element tag omission for the p element as used in HTML:

<p>This is the first paragraph.
<p>This is the second.

SGML (when IMPLYDEF ELEMENT ANYOTHER is active and no declaration for the p element is present) will parse this as if

<p>This is the first paragraph.</p>
<p>This is the second.

had been specified, eg. SGML will infer the </p> end-element tag upon seeing the <p> start-element tag for the second paragraph.

Tag omission on document-level elements

Moreover, when put in a context where paragraph elements are usually expected in HTML, the second omitted </p> end-element (and additional missing elements) is inferred as well.

For example, putting the two paragraph paragraph into a text file, and (optionally) adding a <title> element as follows

<title>Tag omission in HTML paragraphs
<p>This is the first paragraph.
<p>This is the second.

then parsing it using either the Full or the Minimal DTD for HTML5.1 is treated as if the following had been specified:

<html>
	<head>
		<title>Tag omission in HTML paragraphs</title>
	</head>
	<body>
		<p>This is the first paragraph.</p>
		<p>This is the second.</p>
	</body>
</html>

The html, head, and body tags are inferred based on the following DTD declarations for these well-known elements

<!ENTITY % metadata "base|link|meta|noscript|script|style|template|title">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>

Note that all of these element declarations, by using O O (double capital letter O for "omission") as tag omission indicator, declare the respective element to admit both start- and end-tag omission.

SGML will

  1. create an html element if it isn't there, knowing that an html document element must be the first content element in an HTML file

  2. infer the head element if it isn't there, since the content model requires it at the start of html's content

  3. place the title element as child content of head, since it's allowed/expected to occur here acoording to heads model group expression (formed by substituting %metadata; by the base|link...|title string also used in the Full HTML5.1 DTD)

  4. infer the end-element tag for title and for the head element (since the p element following the title isn't allowed to occur in those)

  5. infer the body element (if it isn't there), since it's required to follow the head element

  6. finally, place the p elements or other flow content

  7. infer the end-element tags for p, body, and html at the end of the document.

Tag omission in lists

Tag omission in ul and ol elements is based on the following DTD declarations:

<!ELEMENT ul - - (li)* +(%scripting)>
<!ELEMENT ol - - (li)* +(%scripting)>

The ul, and ul elements themselves don't admit tag omission. But the li element, being not declared at all, can use IMPLYDEF ELEMENT ANYOTHER based end-element tag inference, analogous to the p element example above.

For example, the following HTML fragment

<ul>
<li>A list item
<li>Another list item
<ul>

is parsed as if the end-tags the li elements had been specified:

<ul>
<li>A list item</li>
<li>Another list item</li>
<ul>

Tag omission in definition lists

Definition lists are declared as follows (using the same declaration as the Full DTD)

<!ELEMENT dl - - (dt+,dd+)* +(%scripting)>
<!ELEMENT dt - O ANY -(dt,dd)>

A declaration for dd is absent, meaning dd is using end-tag omission afforded by IMPLYDEF ELEMENT ANYOTHER.

Basic tag omission in definitions lists works as follows:

<dl>
<dt>Term 1
<dd>Definition 1
<dd>Definition 1.2
</dt>

Like other lists and block-level elements, definition lists must be started with an explicit start-element tag; HTML also requires an explicit start-element tag for dl.

  • the dt element is terminated as soon as the dd start-element tag is encountered, because dd appears as excluded element in dt's content exceptions

  • the first dd element is terminated by the subsequent dd element due to the end-tag inference afforded by IMPLYDEF ELEMENT ANYOTHER

  • the second dd element is terminated along with the </dl> end-element tag.

The following example illustrates a basic difference between the Minimal DTD and the Full DTD with respect to dd end-tag omission:

<!-- explicit dd end-element tag to stop dl nesting -->
<dl>
<dt>Term 1
<dd>Definition 1
<dd>Definition 1.2</dd>
<dt>Term 2
<dd>Definition 2
</dl>

The example starts with the same sequence of markup events as before; the second dd element must be explicitly closed when using the Minimal HTML5.1 DTD (but can be omitted when using the Full HTML5.1 DTD).

This is because, by default in HTML, definitions (dd elements) may contain nested definition lists (and thus dt elements); hence, mere occurence of dt anywhere in dd content can't be used as a signal to end dd elements.

The Full DTD, on the other hand, can infer dd's end-element tag because it has knowledge of all HTML elements that can appear directly as content of dd (so the SGML parser can terminate dd when it sees dt).

If, in the above fragment, the </dl> end-element had been omitted, then parsing using the Minimal DTD would result in a nested dt/dd sequence as the child content of the second dd element as follows:

<dl>
	<dt>Term 1</dt>
	<dd>Definition 1.1</dd>
	<dd>Definition 1.2
		<dt>Term 2</dt>
		<dd>Definition 2.1</dd>
		<dd>Definition 2.2</dd>
	</dd>
</dl>

More often than not in a context where tag omission is used in authoring, definition list nesting is probably undesired. To force a dt element to end a dd element (in the Minimal DTD, which doesn't "know" all HTML elements), is to disallow dt as child content of dd. While this assumption is not for the the Minimal DTD to make in general, it can be easily achieved using a declaration in the document's internal subset such as

<!DOCTYPE html SYSTEM ".. URL of Minimal HTML5.1 DTD ..." [
	<!ELEMEMT dd - O -(dd|dt)>
]>

This will declare the dd element to SGML, hence stop IMPLYDEF ELEMENT ANYOTHER inference of dd end-element tags. To compensate for it, dd is excluded; moreover, dt is excluded as well, which will have the effect that dd is automatically closed when a dt (or dd) element is encountered.

While the HTML5.1 DTDs don't make use of it, SGML also supports implementing start-element tag omission on the dt element, allowing even shorter forms of writing definition lists such as

<!-- not supported with the HTML5.1 DTDs provided here -->
<dl>
Term
<dd>Definition
</dl>

Tag omission in tables

The table-related elements are declared as follows in the Minimal HTML5.1 DTD

<!ELEMENT table - - (caption?,colgroup*,thead?,(tbody*|tr+),tfoot?) +(%scripting;)>
<!ELEMENT thead - O (tr*) +(%scripting;)>
<!ELEMENT tbody O O (tr*) +(%scripting;)>
<!ELEMENT tfoot - O (tr*) +(%scripting;)>
<!ELEMENT tr - O (td|th)* +(%scripting;)>
<!ELEMENT th - O ANY -(th|td|tr)>

Similar to the limitations with respect to dd end-element tag omission explained before, this declaration restricts th but not td elements (which are allowed to contain nested tables according to the HTML specification); hence </td> end-element tags must be placed before <tr> elments starting new table rows:

<table>
	<tr>
		<th>table-head 1
		<th>table-head 2
	<tr>
		<th>table-head 1
		<td>table-head 2</td>
	<tr>
		<td>table-data 1
		<td>table-data 2
</table>

Again, the Full DTD doesn't have this limitation, and it can be switched of in a document using the Minimal DTD by using a custom declaration for the td element; for exampple, if the internal subset contains the declaration

<!ELEMENT td - O ANY -(table|th|td|tr)>

then the otherwise required </td> end-element can be omitted (at the expense of disallowing nested tables, which however is usually a recommended practice anyway).

Aggressive use of tag omission in table content is discouraged; for more info on table models, see also the section on table content representation in the Full DTD.

Other element declarations

Apart from declarations necessary to drive tag omission of the html, head, dl, ul, ol, table, and thead and some of their immediate child content elements as explained before, the Minimal HTML5.1 DTD contains

  • declaration for the div, span, and section elements to switch of end-tag omission due to IMPLYDEF ELEMENT ANYOTHER on these

  • declaration of the script and style elements with CDATA declared content (the same declaration declaration as used in the full HTML5.1 DTD)

  • of the remaining elements, those element declarations for elements with declared content EMPTY in the Full DTD, ie. the HTML void elements base, link, meta, hr, br, wbr, img, param, source, track, area, col, input, keygen, menuitem

Attribute declarations

Only attribute declarations with "unusual" parsing rules are included in the Minimal HTML5.1 DTD, other attributes in content are permitted due to IMPLYDEF ATTRIB YES; these are HTML5's Boolean attributes and other enumerated attributes.

Specifically, the hidden and the lang global attribute are declared on every element using the declaration

<!ATTLIST #ALL hidden (hidden) #IMPLIED lang NMTOKEN #IMPLIED>

along with a couple of attribute declarations for the enumerated attributes of HTML, declared on individual elements.

Note that element declarations for the elements on which enumerated attributes can occur aren't necessarily included in the Minimal DTD (ie. only insofar as necessary for other purposes).