This is an extended, corrected version of The HTML 5.1 DTD
presented at the XML Prague 2017 conference. Compared to the
initial revision for W3C HTML5, it introduces
a "minimal" DTD for practical HTML parsing and processing,
and takes a lenient approach for open problems described
in the earlier analysis related to script
and style
data content and namecasing of elements, and, in line
with historic DTDs for HTML, gives up modelling HTML's id
and href
attributes as SGML attributes with declared value
ID
and IDREF
, respectively, and insteads treats these
as ordinary CDATA
values. Moreover, the former
nomenclature of a "Restrictive" and "Permissive"
DTD was changed in favour of a "Full" and "Minimal" DTD,
respectively (the "restrictive"/"permissive" nomenclature
was unfortunate, as it clashes with the use of these terms
in the historic HTML 4.01 DTD for different concepts).
The Full HTML5.1 DTD is a transcription of WHATWG's HTML specification prose into an SGML DTD. If follows WHATWG snapshots as published by W3C (WHATWG itself doesn't publish stable snapshots of its specifications). The Full DTD covers all elements of HTML, SVG, MathML, and the ARIA attributes, and its construction is described in the reference for the W3C HTML 5 DTD, with only modifications for version 5.1 described in this document.
The Minimal HTML5.1 DTD
is a compact DTD containing only essential parsing rules for HTML.
As only HTML's special rules for HTML void elements and
enumerated attributes are included (others being admitted
freely), the Minimal HTML5.1's DTD
usefulness for validation purposes is limited. Instead, the
purpose of the Minimal HTML5.1 DTD is to provide a
minimal bundled declaration set for content parsing and
production tasks for modern and idiomatic HTML in sgmljs.net
and other SGML software with support for resolving
declaration sets via catalog resolution (in sgmljs.net,
the Minimal HTML5.1 DTD is resolved and accessed by
the about:legacy-compat
system identifier).
Casual readers will most likely be interested in the Minimal DTD; its introductory text also contains an easy-to-follow introduction to tag omission and other forms of shorthand markup in HTML and SGML.
These DTDs are primarily useful for checking/validating and normalizing HTML. In SGML applications, it's common (and the point of using SGML in the first place) to define custom DTDs containing application-specific grammar and processing rules, including for generic HTML applications such as outlining, metadata extraction, search result formatting, paging, templating, etc. It is expected (and explicitly permitted) to create custom DTDs based on the HTML5.1 DTDs provided here.
doesn't include attribute default values and predefined HTML entities (character entity references) as explained in attribute defaults
is designed to be used with the restrictive variant of the SGML declaration for HTML5.1.
is essentially already described in detail in the previous reference for W3C HTML5 and only brought up to date with W3C HTML 5.1
as opposed to the Full DTD, is designed to
be used with the permissive variant of the
SGML Declaration for HTML5
allowing undeclared elements and can only be used
with SGML systems supporting WebSGML/ISO 8879 Annex
additions; specifically, it makes use of WebSGML's
IMPLYDEF ELEMENT ANYOTHER
feature
(to be able to infer omitted end tags on p
, li
,
and other elements), and of IMPLYDEF ATTRIBUTE YES
(allowing undeclared attributes to be used)
The Minimal HTML5 DTD is an extract of the
Full HTML5.1 DTD, and edited to make
use of IMPLYDEF ELEMENT ANYOTHER
and other WebSGML features.
IMPLYDEF ELEMENT ANYOTHER
is an SGML declaration property
allowing (like IMPLYDEF ELEMENT YES
), undeclared
elements to occur in document instances. If an undeclared
element x
is encountered in a document, it will be treated
as if it were declared <!ELEMENT x - O ANY>
, which means
that any element or character data is permitted as child content
of x
, and moreover, that x
's end-element tag can be omitted.
In regular SGML, end-element tag omission is only considered if either
Declaring an end-tag omisssion indicator (the letter O
in
the declaration) can't have consequences for the latter two cases
if neither a content model nor content exclusion exceptions
have been declared on the respective element. WebSGML's implied
default declaration for elements, <!ELEMENT x - O ANY>
,
has neither; however, WebSGML's IMPLYDEF ELEMENT ANYOTHER
feature, when activated, will treat undeclared elements as
completed and infer an end-element tag (if missing in content), if
an element is immediately followed by a start-element tag
for the same element.
For example, consider end-element tag omission for the p
element as used in HTML:
<p>This is the first paragraph.
<p>This is the second.
SGML (when IMPLYDEF ELEMENT ANYOTHER
is active and
no declaration for the p
element is present) will parse
this as if
<p>This is the first paragraph.</p>
<p>This is the second.
had been specified, eg. SGML will infer the </p>
end-element tag upon seeing the <p>
start-element
tag for the second paragraph.
Moreover, when put in a context where paragraph elements
are usually expected in HTML, the second omitted </p>
end-element (and additional missing elements) is inferred
as well.
For example, putting the two paragraph paragraph into a text
file, and (optionally) adding a <title>
element as follows
<title>Tag omission in HTML paragraphs
<p>This is the first paragraph.
<p>This is the second.
then parsing it using either the Full or the Minimal DTD for HTML5.1 is treated as if the following had been specified:
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.</p>
<p>This is the second.</p>
</body>
</html>
The html
, head
, and body
tags are inferred based
on the following DTD declarations for these well-known elements
<!ENTITY % metadata "base|link|meta|noscript|script|style|template|title">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
Note that all of these element declarations, by using O O
(double capital letter O for "omission") as tag omission indicator,
declare the respective element to admit both start- and end-tag
omission.
SGML will
create an html
element if it isn't there, knowing that an
html
document element must be the first content element in
an HTML file
infer the head
element if it isn't there, since the
content model requires it at the start of html
's content
place the title
element as child content of head
, since it's
allowed/expected to occur here acoording to head
s model group
expression (formed by substituting %metadata;
by the
base|link...|title
string also used in the Full
HTML5.1 DTD)
infer the end-element tag for title
and for the head
element
(since the p
element following the title isn't allowed to
occur in those)
infer the body
element (if it isn't there), since it's
required to follow the head
element
finally, place the p
elements or other flow content
infer the end-element tags for p
, body
, and html
at the
end of the document.
Tag omission in ul
and ol
elements is based on the
following DTD declarations:
<!ELEMENT ul - - (li)* +(%scripting)>
<!ELEMENT ol - - (li)* +(%scripting)>
The ul
, and ul
elements themselves don't admit tag omission.
But the li
element, being not declared at all, can use
IMPLYDEF ELEMENT ANYOTHER
based end-element tag inference,
analogous to the p
element example above.
For example, the following HTML fragment
<ul>
<li>A list item
<li>Another list item
<ul>
is parsed as if the end-tags the li
elements had
been specified:
<ul>
<li>A list item</li>
<li>Another list item</li>
<ul>
Definition lists are declared as follows (using the same declaration as the Full DTD)
<!ELEMENT dl - - (dt+,dd+)* +(%scripting)>
<!ELEMENT dt - O ANY -(dt,dd)>
A declaration for dd
is absent, meaning dd
is
using end-tag omission afforded by IMPLYDEF ELEMENT ANYOTHER
.
Basic tag omission in definitions lists works as follows:
<dl>
<dt>Term 1
<dd>Definition 1
<dd>Definition 1.2
</dt>
Like other lists and block-level elements, definition lists
must be started with an explicit start-element tag; HTML
also requires an explicit start-element tag for dl
.
the dt
element is terminated as soon as the dd
start-element tag is encountered, because dd
appears
as excluded element in dt
's content exceptions
the first dd
element is terminated by the subsequent
dd
element due to the end-tag inference afforded by
IMPLYDEF ELEMENT ANYOTHER
the second dd
element is terminated along with
the </dl>
end-element tag.
The following example illustrates a basic difference between
the Minimal DTD and the Full DTD with respect to
dd
end-tag omission:
<!-- explicit dd end-element tag to stop dl nesting -->
<dl>
<dt>Term 1
<dd>Definition 1
<dd>Definition 1.2</dd>
<dt>Term 2
<dd>Definition 2
</dl>
The example starts with the same sequence of markup
events as before; the second dd
element must be
explicitly closed when using the Minimal HTML5.1
DTD (but can be omitted when using the Full HTML5.1 DTD).
This is because, by default in HTML, definitions
(dd
elements) may contain nested definition lists
(and thus dt
elements); hence, mere occurence of dt
anywhere in dd
content can't be used as a signal
to end dd
elements.
The Full DTD, on the other hand, can infer
dd
's end-element tag because it has knowledge of all
HTML elements that can appear directly as content of
dd
(so the SGML parser can terminate dd
when it
sees dt
).
If, in the above fragment, the </dl>
end-element
had been omitted, then parsing using the Minimal DTD
would result in a nested dt
/dd
sequence as
the child content of the second dd
element as
follows:
<dl>
<dt>Term 1</dt>
<dd>Definition 1.1</dd>
<dd>Definition 1.2
<dt>Term 2</dt>
<dd>Definition 2.1</dd>
<dd>Definition 2.2</dd>
</dd>
</dl>
More often than not in a context where tag omission
is used in authoring, definition list nesting is probably
undesired. To force a dt
element to end a dd
element
(in the Minimal DTD, which doesn't "know" all HTML
elements), is to disallow dt
as child content of dd
.
While this assumption is not for the the Minimal
DTD to make in general, it can be easily achieved using
a declaration in the document's internal subset such as
<!DOCTYPE html SYSTEM ".. URL of Minimal HTML5.1 DTD ..." [
<!ELEMEMT dd - O -(dd|dt)>
]>
This will declare the dd
element to SGML, hence stop
IMPLYDEF ELEMENT ANYOTHER
inference of dd
end-element tags.
To compensate for it, dd
is excluded; moreover, dt
is excluded
as well, which will have the effect that dd
is automatically
closed when a dt
(or dd
) element is encountered.
While the HTML5.1 DTDs don't make use of it, SGML
also supports implementing start-element tag omission on
the dt
element, allowing even shorter forms of writing
definition lists such as
<!-- not supported with the HTML5.1 DTDs provided here -->
<dl>
Term
<dd>Definition
</dl>
The table-related elements are declared as follows in the Minimal HTML5.1 DTD
<!ELEMENT table - - (caption?,colgroup*,thead?,(tbody*|tr+),tfoot?) +(%scripting;)>
<!ELEMENT thead - O (tr*) +(%scripting;)>
<!ELEMENT tbody O O (tr*) +(%scripting;)>
<!ELEMENT tfoot - O (tr*) +(%scripting;)>
<!ELEMENT tr - O (td|th)* +(%scripting;)>
<!ELEMENT th - O ANY -(th|td|tr)>
Similar to the limitations with respect to dd
end-element
tag omission explained before, this declaration restricts th
but not td
elements (which are allowed to contain nested tables
according to the HTML specification); hence </td>
end-element
tags must be placed before <tr>
elments starting new table
rows:
<table>
<tr>
<th>table-head 1
<th>table-head 2
<tr>
<th>table-head 1
<td>table-head 2</td>
<tr>
<td>table-data 1
<td>table-data 2
</table>
Again, the Full DTD doesn't have this limitation,
and it can be switched of in a document using the Minimal
DTD by using a custom declaration for the td
element;
for exampple, if the internal subset contains the declaration
<!ELEMENT td - O ANY -(table|th|td|tr)>
then the otherwise required </td>
end-element
can be omitted (at the expense of disallowing nested tables,
which however is usually a recommended practice anyway).
Aggressive use of tag omission in table content is discouraged; for more info on table models, see also the section on table content representation in the Full DTD.
Apart from declarations necessary to drive tag omission
of the html
, head
, dl
, ul
, ol
, table
, and thead
and some of their immediate child content elements
as explained before, the Minimal HTML5.1 DTD contains
declaration for the div
, span
, and section
elements
to switch of end-tag omission due to IMPLYDEF ELEMENT ANYOTHER
on these
declaration of the script
and style
elements with CDATA
declared content (the same declaration declaration as used in
the full HTML5.1 DTD)
of the remaining elements, those element declarations for elements
with declared content EMPTY
in the Full DTD, ie. the HTML
void elements base
, link
, meta
, hr
, br
,
wbr
, img
, param
, source
, track
, area
,
col
, input
, keygen
, menuitem
Only attribute declarations with "unusual" parsing rules
are included in the Minimal HTML5.1 DTD, other attributes
in content are permitted due to IMPLYDEF ATTRIB YES
; these
are HTML5's Boolean attributes
and other enumerated attributes.
Specifically, the hidden
and the lang
global attribute
are declared on every element using the declaration
<!ATTLIST #ALL hidden (hidden) #IMPLIED lang NMTOKEN #IMPLIED>
along with a couple of attribute declarations for the enumerated attributes of HTML, declared on individual elements.
Note that element declarations for the elements on which enumerated attributes can occur aren't necessarily included in the Minimal DTD (ie. only insofar as necessary for other purposes).