SGML

HTML5 DTD Reference

Overview

The Restrictive HTML5.1 DTD is a transcription of WHATWG's HTML specification prose into an SGML DTD. If follows WHATWG snapshots as published by W3C (WHATWG itself doesn't publish stable snapshots of its specifications). The restrictive DTD covers all elements of HTML, SVG, MathML, and the ARIA attributes.

The Permissive HTML5.1 DTD is a compact DTD based on an analysis of the Restrictive DTD containing essential parsing rules for HTML. As only HTML's special elements and attributes are included (others being admitted freely), the permissive HTML5.1's DTD usefulness for validation purposes is limited; instead, the goal of the permissive HTML5.1 DTD is to come up with a really well motivated, compact, and forward-compatible SGML DTD for use and deployment in software such as SGML User Agent for in-browser SGML support.

Casual readers will most likely be interested in the permissive DTD; its introductory text also contains an easy-to-follow introduction to tag omission and other forms of shorthand markup in HTML and SGML.

These DTDs are primarily useful for checking/validating and normalizing HTML. In SGML applications, it's common (and the point of using SGML in the first place) to define custom DTDs containing application-specific grammar and processing rules, including for generic HTML applications such as outlining, metadata extraction, search result formatting, paging, templating, etc. It is expected (and explicitly permitted) to create custom DTDs based on the HTML5.1 DTDs provided here.

The Restrictive HTML5.1 DTD

  • is a straightforward translation of the specification text for HTML's content model and tag omission rules into DTD grammar rules; the specification text is included as SGML comment along with the translated DTD rule for reference; where HTML content model rules are represented incompletely, a note is included in the SGML comment for the declaration as well
  • doesn't include attribute default values and predefined HTML entities (character entity references) as explained in attribute defaults : is designed to be used with the restrictive variant of the SGML declaration for HTML5.

The Permissive HTML5.1 DTD

  • only contains declarations needed for parsing HTML void elements, omitted attribute names, commonly used unquoted attribute values, and omitted start- and end-element tags allowed by HTML
  • as opposed to the restrictive DTD, is designed to be used with the permissive variant of the SGML Declaration for HTML5 allowing undeclared elements and can only be used with SGML systems supporting WebSGML/ISO 8879 Annex additions; specifically, it makes use of WebSGML's IMPLYDEF ELEMENT ANYOTHER feature (to be able to infer omitted end tags on p, li, and other elements), and of IMPLYDEF ATTRIBUTE YES (allowing undeclared attributes to be used)

The Restrictive HTML5 DTD

WHATWG/W3C's HTML5 specification document states that

XML DTDs cannot express all the conformance requirements of this specification. Therefore, a validating XML processor and a DTD cannot constitute a conformance checker. Also, since neither of the two authoring formats defined in this specification are applications of SGML, a validating SGML system cannot constitute a conformance checker either

but doesn't provide examples where SGML can't specifically be used for checking. For XML DTDs and other XML-based schema languages it's easy enough to conclude these can't describe HTML for their lack of a way to express empty elements, omitted tags, omitted attribute names, and unquoted attributes, among other things.

For SGML, on the other hand, it's less obvious, hence this text discusses parsing and validation issues of modern HTML using SGML in depth.

Flow and phrasing content

The HTML5.1 specification text introduces the elements of HTML using a taxonomic approach and presents a classification and accompanying Venn diagram depicting inclusion relationships between HTML element categories derived from definitions contained in earlier HTML DTDs.

It is felt that the fundamental grammatical construction of the HTML vocabulary as inline content wrapped in an optional layer of block-level content isn't quite apparent in this definition. Earlier HTML DTDs included the following definition for flow capturing this in a rather straightforward way:

<!ENTITY % flow "%block; | %inline;">

While HTML5 lacks it, a definition for "block content" rises again by subtracting those elements from the "flow content" category that aren't also in the "phrasing content" (inline) category.

Nesting of phrasing into flow elements is about the most basic property of the HTML grammar. In SGML, the flow/phrasing hierarchy is expressed by declaring the content model of flow elements as allowing %phrasing;, where %phrasing;, like in earlier HTML DTDs, is substituted into a string such as a|abbr|... containing all phrasing elements as a name group.

For example, the element declaration for the p element is as follows:

<!ELEMENT p - O (#PCDATA|%phrasing;)* -(%flow_only;)>

meaning that

  • the content of p elements can be any sequence of text content (#PCDATA) and phrasing elements, and

  • flow content isn't just forbidden in direct child content of p (via admitting %phrasing; elements only), but also isn't admitted anywhere in descendant content of p, as expressed by the -(%flow_only;) SGML exclusion exception, and

  • the end-element, but not the start-element tag of p can be omitted, as declared in ps tag omission indicator - O (see Tag Omission)

where the flow_only parameter entity contains block-level elements as described above, ie. flow elements that aren't also phrasing elements.

Body content

In older HTML DTDs, formally only block-level elements can appear directly in a HTML document body; phrasing content had to be wrapped into at least a paragraph (or generic block-level div) container element.

However, browsers never inferred block-level elements when they where missing in content (or made their presence visible in the DOM). Essentially, this constraint was never enforced.

The HTML5.1 grammar follows actual browser behaviour, in that any flow content, including phrasing content, is formally accepted as direct child of the body element.

Tag omission

The HTML5 specification lists tag omission rules for each applicable element (or element combination) individually. For example, the specification text for the p element reads

A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.

For this HTML5 DTD, the text description is transcribed into SGML tag omission indicators in the most straightforward way, based on whether start-element tag, end-element tag, or both start- and end-element tag omission is allowed at all, avoiding verbose enumeration of specific elements.

For example, the tag omission rules for p are represented using this simple SGML element declaration:

<!ELEMENT p - O (#PCDATA|%phrasing;)*>

where the - (hyphen/minus, meaning "not omissible") and the O (letter O, meaning "omissible") prescribes ps start-element and end-element tag omission behaviour, respectively, and %phrasing; expands to the string a|abbr|... containing HTML's phrasing elements.

This element declaration will make an SGML parser end a paragraph element if encountering any element not in the phrasing category, or an end-element tag which isn't balanced within p's content, thereby capturing HTML's parsing rules.

As presented in the HTML5 specification, the choice of explicitly enumerated elements that cause the paragraph element to be terminated may seem arbitrary, but is in fact (up to potential minor omissions considered errors) the set of HTML elements that are in the "flow" category, without also being in the "phrasing" category. This isn't surprising, since HTML parsing rules were originally specified using SGML grammars such as the above. By recovering HTML's original parsing rules from HTML5's specification text, we conclude that HTML5's parsing rules are represented adequately, and more succinctly, since avoiding redundantly specifying p-terminating elements.

This interpretation is also supported by the fact that for the HTML5.1 specification update (vs. HTML5), the new details, figcaption, figure and menu elements (which are flow-only elements) but not the new picture element (which can also be used in phrasing content) have been added to the set of p-terminating elements.

End-element tag omission

End-element tag omission is commonly used in HTML in the following situations

  • on list items when directly followed by other list items or when followed by ul or ol end-element tags

  • on definition list terms or definitions, when followed by other terms or definitions, or when followed by a dl end-element tag

  • on paragraphs, when followed by other paragraphs or by a parent element's end-element tag
  • in head content, upon encountering an element that can't be placed into head element content

  • at the end of a document. for html, body, and other elements (these elements also allow start-tag omission as discussed in the context of the permissive DTD in Tag omission on document-level elements, and the rules provided there fully apply to the restrictive DTD as well)

All of these uses are parsed/validated by this HTML5 DTD in the expected way and in the same way as W3C's HTML5 validation software (as far as can be told).

Start-element tag omission

Start-element tag omission (other than in table content), is only allowed on the html, head, and body elements, which is trivially supported by this HTML5.1 DTD in the expected way; see Tag omission on document-level elements.

Start- and end-element tag omission in table content

Tag omission in table content deserves a closer examination.

The relevant specification text reads as follows:

The table element

Content: In this order: optionally a caption element, followed by zero or more colgroup elements, followed optionally by a thead element, followed by either zero or more tbody elements or one or more tr elements, followed optionally by a tfoot element, optionally intermixed with one or more script-supporting elements.

Tag omission: Neither tag is omissible.

The thead element

Content: Zero or more tr and script-supporting elements

Tag omission: A thead element's end tag may be omitted if the thead element is immediately followed by a tbody or tfoot element.

The tfoot element

Content: Zero or more tr and script-supporting elements

Tag omission: A or element's end tag may be omitted if the tbody element is immediately followed by a tbody element.

The tbody element

Content: Zero or more tr and script-supporting elements.

Tag omission: A tbody element's start tag may be omitted if the first thing inside the tbody element is a tr element, and if the element is not immediately preceded by a tbody, thead, or tfoot element whose end tag has been omitted. (It can't be omitted if the element is empty.). A tbody element's end tag may be omitted if the tbody element is immediately followed by a tbody or tfoot element, or if there is no more content in the parent element.

By the specification text, the following HTML fragment (representing typical use of tag omission in table content) is allowed and accepted by both this HTML5 DTD, as well as by W3C's HTML5 validation software:

<table>
	<thead>
		<tr>ONE
		<tr>TWO
		<tr>THREE
	<tbody>
		<tr>One...
		<tr>Two...
		<tr>Three...
</table>

However, the specification text for tbody also admits omission of a tbody start-element tag if the first thing inside the tbody element is a tr element, and if the element is not immediately preceded by a tbody, thead, or tfoot element whose end tag has been omitted.

The relevant part of the content model for tbody's parent element, table, admits either zero or more tbody elements, or one or more tr elements.

Hence, table content such as the following

<table>
	<thead>
		<tr>ONE
		<tr>TWO
		<tr>THREE
	</thead>
	<tr>One...
	<tr>Two...
	<tr>Three...
</table>

is valid according to at least two rules in the HTML specification: the tr elements following thead can be parsed by HTML5's parsing rules either as tr elements being placed directly into table content, or as an instance of a tbody element with omitted start- and end-element tags.

This HTML5 DTD (as does Mozilla's validator.nu software and web browsers) interprets such content as an instance of the former case, and requires that a tbody start-element is present in content to force the latter interpretation. The ambiguous production rule for tbody, as stated in the HTML5.1 specification, can never apply in the absence of start-element tags for tbody.

Presumably, ambiguousness of tag omission rules for table content is inadvertent; even the specification text (chapter 6.7.6) itself seems to use tag omission in table models incorrectly. The W3C HTML5.1 specification contains this fragment (the leading paragraph is included for locating the text place in the document):

<p>The element <var>host element</var> to create for the media is the element given in
the table below in the second cell of the row whose first cell describes the media. The
appropriate attribute to set is the one given by the third cell in that same row.</p>
<table>
	<thead>
		<tr>
			<th> Type of media
			<th> Element for the media
			<th> Appropriate attribute
		<tr>
			<td> Image
			<td> <code>img</code>
			<td> <code>src</code>
		<tr>
			<td> Video
			<td> <code>video</code>
			<td> <code>src</code>
		<tr>
			<td> Audio
			<td> <code>audio</code>
			<td> <code>src</code>
</table>

Note this table doesn't contain a body which however isn't the point here. Going by how the table is rendered and by analogy with other text places containing table use which do have a tbody element specified, it can be concluded what is probably intended here is that the first tr element should be treated as major table heading row, while subsequent rows should be treated as table body rows.

The rule for tag omission of the thead element reads A thead element's end tag may be omitted if the thead element is immediately followed by a tbody or tfoot element (in both the W3C HTML5.1, and the current WHATWG specification text), hence we expect the above fragment to be rejected since the rule does not say that a thead end-element tag can be omitted if followed by a table end-element tag (when other parsing rules for end-element omission state this kind of condition explicitly).

However, the fragment is happily accepted by W3C's validation software, and hence slipped into the published specificaton text; the HTML5 DTD follows validator.nu here and accepts it as well.

In an attempt to interpret HTML5's informally stated syntax description, we note that a sentence such as A thead element's end tag may be omitted if the thead element is immediately followed by a tbody or tfoot element is inherently self-referential since the thead element's end isn't yet established while assessing it's end-element omission status, hence whatever follows it in content isn't either (the definition of tbodys start-element tag omission stated earlier has a similar problem).

Further analysis of HTML5's expected behaviour requires a stated formal semantics for interpreting its syntax rules (standard semantics such as co-inductive/well-founded semantics can't be applied here without further qualification). Given the lack of such semantics, and that multiple (quite obvious) flaws were found in table content models on a cursory look already, and given mildly surprising results when using the reference validation software, further discussion of HTML5's table parsing rules seems hopeless, and isn't expected to contribute to a definition of an interoperable table content model.

Hence, while the HTML5 DTD behaves the same as the reference validation software, authors are advised to not rely on tag omission in table content beyond basic idiomatic usage as described.

The datalist element

The datalist element's content definition has changed from previous releases. It's specification text now reads

The datalist element

Content: Either: phrasing content. Or: Zero or more option and script-supporting elements.

and the mapping into an element declaration is as follows:

<!ELEMENT datalist - - ((#PCDATA|%phrasing;)*|(option|script)*)>

Note that only the script element, rather than any script-supporting element is supported. The script-supporting elements in HTML5.1 includes the template element. However, the template element is also phrasing content. When using %scripting; (which includes both the script and the template element, and is used as parameter entity reference elsewhere in the HTML5.1 DTD), the grammar for datalist will become 1-ambiguous. This means that upon encountering a template element in a datalist parent element, the parser cannot decide which of the two branches declared in the choice submodels of datalists grammar rule is to be selected for subsequent parsing. This is not permitted in SGML, and either disallowed or undesirable in other markup languages as well.

Semantically, it doesn't make sense to use template elements in datalist child content, hence the allowance of template is considered accidental (or a consequence of HTML5's grammar presentation which doesn't facilitate basic automated grammar checks).

Boolean Attributes

HTML5 lists the following as boolean attributes: reversed, ismap, typemustmatch ,default ,autoplay, muted ,checked ,readonly ,required ,multiple ,disabled, selected ,readonly ,required ,reversed ,disabled ,autofocus, autoplay ,novalidate ,formnovalidate ,hidden ,lang ,async, defer, and the truespeed attribute on the deprecated marquee element.

Note the paused attribute isn't a boolean attribute.

HTML5's boolean attributes are modelled as SGML attribute declarations having a singleton name group as declared attribute value, ie. an enumerated value where the name group contains only a single value.

For example, the selected (and disabled) attribute on HTML5's option element, according to the HTML5 specification, must be specified as eg. <option selected>, and the HTML5 DOM API is supposed to treat the selected attribute as either true or false. If a false value is desired, the selected attribute must be omitted in an attribute specification.

In SGML, this is modelled as

<!ATTLIST option selected (selected) #IMPLIED>

meaning the attribute name can be omitted.

According to the declaration, specifying

<option selected>

is equivalent to specifying

<option selected=selected>

or

<option selected="selected">.

Note that, formally, WebSGML (ISO 8879 Annex K) allows use of the same name token as enumerated value for multiple attribute declarations. In prior versions of SGML, the following wasn't valid:

<!ATTLIST x a (true|false) #IMPLIED>
<!ATTLIST y a (true|false) #IMPLIED>

because the name tokens true and false could only be used in a single attribute declaration; one had to declare:

<!ATTLIST (x|y) a (true|false) #IMPLIED>

At the same time, pre-Annex K SGML only allowed a single attribute list declaration for a given element.

WebSGML relaxes this constraint by allowing

  • declared attributes to be asseambled from multiple ATTLIST declarations for the same element(s), and

  • enumerated attributes (token name groups) to contain the same token in different attribute declarations (including on the same element); note that if a token is declared on multiple elements, it cannot be used with omitted attribute name

However, (Open)SP doesn't seem to implement Annex K in this respect, and will reject multiple ATTLIST declarations on the same element and also multiple declarations for the same name token.

While sgmljs.net SGML doesn't have this restriction, for interoperability, the HTML 5 DTD generator outputs the boolean attributes inline along with other attributes on an element.

The contenteditable and spellcheck attributes

The contenteditable and spellcheck attributes are handled special; these can have their values omitted in HTML5 but cannot be modelled in SGML like the "boolean" attributes, because they both use true/false as enumerated values, and thus can't be handled via SGML MINIMIZE ATTRIB OMITNAME (which requires that name tokens be unique among those declared for all attributes in a DTD, not just those declared on a given element, in order to make use of OMITNAME).

Void elements

The HTML5 specification lists area, base, br, col, embed, hr, img, input, keygen, link, meta, param, source, track, and wbr as void elements.

Note that the HTML5 specification suggests that the (legacy) elements basefont, bgsound, and frame should also be treated as void, but these don't have declared content EMPTY in this HTML 5 DTD.

HTML5's void elements happen to coincide with those labelled as having the Empty content model in the section on individual elements in the specification text.

Void elements are expected to neither have child content nor an end-element tag, and are adequately modelled as SGML elements with declared content EMPTY.

In the HTML5 specification text, void elements are described as having "No end tag" (in addition to having "Nothing" as content); however, an element declared EMPTY in SGML usually isn't qualified with an end-tag omission indicator, since having declared content EMPTY isn't considered a tag minimization feature in SGML.

Self-closing elements

"Self-closing" elements are uses of HTML void elements which have a slash before the > (U+003E GREATER THAN SIGN) character (eg. the STAGC delimiter in SGML) in the start-element tag, such that void elements appear as XML-style empty elements.

For example, in:

<link href="..." rel="stylesheet" type="text/css" />

the / (U+002F SOLIDUS) character is bogus.

This syntax was used in the past to make HTML processable using XML parsers, and its use is generally discouraged.

While tolerated (ignored) by HTML5 on void elements, in the HTML5 SGML DTD, self-closing elements are subject to the EMPTYNRM YES and other settings in the SGML declaration for HTML5 which synthesize HTML's parsing rules in this respect.

RAWTEXT and RCDATA

The child content of the style element is modelled as SGML CDATA declared content, meaning that any markup delimiters are ignored (up to the sequence of terminating characters as discussed in Script data)

Note that legacy HTML also might assume CDATA semantics with xmp, noembed, and noframes content.

The child content of the textarea element is modelled as RCDATA declared content, which behaves same as CDATA, except delimiters for entity references (SGML's ERO and ERE delimiters, ie. the U+0026 AMPERSAND and the U+003B SEMICOLON characters in the SGML reference concrete syntax, respectively) are recognized, and entity references formed by those are substituted by replacement text.

Note sgmljs.net SGML substitutes only internal parsed entities in an RCDATA context.

Script data

Dispensing with earlier DTDs for HTML declaring the content of the script element CDATA, in this DTD for HTML5.1, child content of the script element is modelled as (#PCDATA) content model for security reasons.

HTMLs script element and it's historic use has been known to be a problem since at least as early as 1996 (cf. Joe English's posts) and keeps being problematic today.

According to the HTML5.1 specification,

  1. after transitioning from script data less-than sign state (which is the state reached in script element child data after having encountered an unescaped < (U+003C LESS THAN) character),

  2. '/' (U+002F SOLIDUS) transitions the parsing state to the script data end tag open state, which in turn will be sent to the

  3. script data end tag name over any ASCII character

  4. In the script data end tag name an HTML5 parser is supposed to check the longest sequence of ASCII characters for a case-insensitive match of script (and finish the script element if this is the case).

For SGML, on the other hand, expected behaviour is to end CDATA or RCDATA on a "delimiter-in-context" ie. < (U+003C LESS-THAN SIGN), followed by / (U+002F SOLIDUS) followed by a name start character (ignoring other irrelevant delimiter-in-context cases here), irrespective of whether the generic identifier started by the name start character is actually script (or, more generally, the same that started CDATA/RCDATA), whereas HTML5 is supposed to treat character data looking like an end-element tag but not actually representing a </script> tag as part of the character content of script.

For example, the following HTML fragment

<script>
	document.innerHTML =
		"<html><head><title>Oops</title><body>Pwnd</body></html>"
</script>

will be parsed by HTML as a single script element with child content, but, provided the script element has been declared

<!ELEMENT script CDATA>

will be parsed by SGML as

  • the <script> start-element tag,

  • the text document.innerHTML = "<html><head><title>Oops,

  • the </title> end-element tag,

  • the </body> start-element tag,

  • the text Pwnd,

  • the </body> end-element tag,

  • the </html> end-element tag, and

  • the </script> end-element tag.

While a DTD using either CDATA or (#PCDATA) for the script element will reject this particular sequence of markup events (because script can't have it's end-tag omitted), in general, this behaviour is undesired since it could be used to mount script injection attacks.

As explained in the HTML 5 specification (Restrictions for contents of script element), there's an additional, rather unneccesary, twist in HTML5's dealing with script data, in that what looks like starting an SGML comment (ie. the character sequence <!--) within script data will make the parser enter script data escaped dash dash state, which is only exited on a subsequent --> character sequence, potentially parsing well beyond what looks like the regular end of the script element.

That is, SGML's <!-- and --> character sequences are recognized as JavaScript comment start- and end- sequences, respectively; presumably, this was an (ill-conceived) attempt in early JavasScript revisions to present uniform commenting syntax accross HTML and JavaScript.

It is problematic since it is completely invisible to SGML. Needless to say, this style of script comments is an avoidable XSS attack vector in web pages.

SGML has never recognized comments in CDATA or RCDATA at all, hence this cannot be handled by SGML other than by using a regular (#PCDATA) content model.

Treating script content as (#PCDATA) can be inconvenient, since it requires that verbatim occurences of the < (U+003C LESS-THAN SIGN) character might have to be specified using the &lt; entity, or that all or parts of the child content is put into CDATA or RCDATA marked sections.

If this turns out to be a problem, the declaration for script can easily be changed to CDATA to re-establish former behaviour.

Security Considerations

For maximum security in applications handling user-provided content (eg. assumed to potentially contain malicious script), it is recommended that, in addition,

  • all SGML/HTML comments should be suppressed (not presented to the browser) when processing HTML
  • admissability of the script element should be controlled in a custom DTD using exclusion exceptions

  • blocking control flow transfer to JavaScript injected into event handler attributes can be implemented for content descending from a given parent element by either

    • using #FIXED event handler attribute values, or

    • registering event handlers in "capturing" mode on the parent element, preventing further capturing/bubbling (using JavaScript).

Driving the latter technique from DTDs alone has limitations since DTD declarations apply globally, but can be performed adequately when using context-dependent LINK processing and/or SGML templating.

Using these techniques allows more granular control over where script is allowed in HTML content compared to Content Security Policy (CSP), (which however isn't a finalized recommendation at this point). For example, CSP's policy of ignoring/not executing script elements in content can be expressed by placing an SGML exclusion exception on body, but could also be applied at a more granular level in arbitrary body child element(s), which isn't possible with CSP.

Arguably, by disabling event handler attributes in HTML alltogether, Content Security Policy questions basic assumptions about locality and modularity of event handling in the JavaScript/HTML authoring model in such a way as to make it almost pointless.

It is particularly unrealistic/uneffective for current practices with syndicated and/or ad-driven web sites, including those that offload eg. discussion forums to third-party services (which happen to be the primary input vectors for user-contributed content).

From Chrome's Content Security Policy (CSP) page:

[Blocking inline script] does, however, require you to write your code with a clean separation between content and behavior (which you should of course do anyway, right?)

To which it must be answered that "separation of concerns" is certainly not a prime characteristic of the content-driven web next to composibility of web content.

Foreign elements

SVG and MathML DTDs are conditionally included via the svg_conditional_inclusion and mathml_conditional_inclusion parameter entities.

The DTDs included will be accessed from the canonical system identifier URLs of their most recently published DTDs.

In the absence of an INCLUDE value for these parameters, the svg and math elements will be declared ANY.

Note that for inclusion of foreign XML vocabularies, EMPTYNRM YES should be specified in the SGML declaration to cater for XML-style empty elements (which are made extensive use of already in basic SVG documents).

Attribute defaults

This HTML5.1 DTD doesn't declare attribute defaults. Instead, it always declares #IMPLIED as default value.

Generally speaking, making subtle distinctions with respect to whether attribute (and other) defaults are specified with their default values in content explicitly as opposed to left unspecified is considered a bad practice, since whether an attribute is specified or implied isn't adequately represented in eg. DOM and similar APIs lacking attribute defaults and other type-related metadata. About the only applications in need of access to this kind of information are HTML authoring and developer tools.

However, the HTML5.1 specification recommends (specifically, for the ARIA attributes) to not specify their default values explicitly (ie. unless their actual value differs from the default).

For a similar reason, the preferred representation for HTML5 named character references are as predefined character entities, rather than entity sets.

Special attribute semantics

This section contains clarifications to the interpretation of certain attribute-related specification text passages. As it turns out, no additional grammar constraints are derived.

CONREF attribute semantics

In SGML, #CONREF attributes are used to control that elements should be treated like EMPTY elements on a case-by-case basis if the respective #CONREF attribute is specified.

As a mechanism for conditionally void elements, the HTML5.1 specification could be interpreted to mean that eg. the span attribute of the colgroup element should behave as #CONREF attribute, given the specification's wording:

(Content model of colgroup)

If the span attribute is present: Nothing

in combination with the fact that having "Nothing" as content is de facto used synonymously with being a void element throughout the specification text.

However, consistent with the HTML4 DTD, #CONREF attributes are not used in this HTML5 DTD.

There are a number of additional case where #CONREF could be applied, and also some irregular cases where #CONREF cannot express HTML5's desired behaviour.

For example, what the specification text says can be interpreted to mean that the src attribute on script elements should be treated as a #CONREF attribute. But script must always be terminated using </script> tags explicitly, even if a src attribute is specified; this is also enforced by this HTML 5 DTD.

CURRENT attribute semantics

According to the specification text, the title attribute, if it is omitted from an element

then it implies that the title attribute of the the nearest ancestor HTML element with a title attribute set is also relevant to this element

This is considered a HTML semantic, not syntactic rule, and isn't represented in the HTML5 DTD, in line with browser DOM APIs not handling title different from other regular attributes.

In SGML, "inheriting" title and other attributes could roughly be modelled using #CURRENT default semantics (even though an unspecified #CURRENT attribute takes its value from any preceding use of that attribute in document order, rather than just from ancestor elements).

Custom data attributes

HTML5 reserves attributes starting with data- as private use attributes (meaning that those won't ever be used by any HTML attribute and will be preserved in a constructed DOM in web browsers).

Their special naming requirements cannot and need not be represented in the HTML5.1 DTD as such, since custom attributes can be declared in the internal subset or another declaration set.

If full validation is desired, data--attributes must be declared.

If full attribute validation isn't desired (IMPLYDEF ATTLIST YES is specified in the SGML declaration or otherwise), such as when only the permissive HTML DTD is used, a custom attribute doesn't have to be declared.

If a custom data attribute is declared, it should be either declared having CDATA value, and have its value always quoted, or should be declared otherwise appropriate with respect to how it's used in content (ie. with respect to omitting quotation characters).

HTML5 makes the restriction that XML naming rules for custom data apply: the name must not contain the : (U+003A COLON) character, and must otherwise represent a valid XML name, and must not begin with xml (case-insensitively). These rules aren't enforced by SGML.

ARIA attributes

Like SVG ant MathML, ARIA attributes can optionally be included via the if_aria parameter entity.

Note the HTML5 DTD doesn't include element-specific declarations for ARIA attributes (ie. restrictions of the permitted role attribute values for individual elements).

Note also, like for other attributes, no attribute defaults are implied for the ARIA attributes (which is recommended practice by the ARIA and the HTML specifications).

Transparent content

Arguably, the most characteristic element of HTML is the anchor (a) element for hyperlinking. Apart from hyperlinks, the core HTML content models are merely variants of paragraph, flow, table, and other content models that were already in wide use for marking up documents for printing in the pre-WWW era.

In HTML5, the content model of the a element (and that of map, ins, and del) is specified as transparent, which means that a "inherits" (for lack of a better word) its parent content model: permitted child content is determined by the parent element's permitted child content (which can inherit its permitted content from its parent element, in turn, and so on).

HTML5's transparent content model concept is an artifact of adding the ability to annotate any piece of flow content as hyperlink, rather than just phrasing content as in previous HTML specifications. In practice, it is commonly used when eg. an image or icon and some belonging text (and possibly some background) is hyperlinked to a common target using a single a element, rather than having to wrap the image, the text, and boilerplate content into a elements individually. Note an extension for using href and other attributes of anchors on any HTML element (with the expectation that this makes those elements behave like hyperlinks, thereby effectively making the anchor element redundant), was proposed already around 2008. Arguably, had this been further pursued, the HTML vocabulary could have been made much simpler and more orthogonal, but it was rejected at the time on the grounds that browser vendors already had implemented the ability to place anchor tags around most HTML elements.

As applied in the HTML specification, transparent content just means that eg. an a element accepts either just phrasing or also flow content as child content, depending on whether it appears in a flow or phrasing context, respectively (and similarly with map, ins, a del). The concept of transparent content, however, doesn't extend to arbitrary elements, because it trivially conflicts with the content model descriptions of those elements into which elements having transparent content can be placed. Specifically, for any element which accepts an element having transparent content, it needs to be stated whether, and how, elements wrapped into transparent content-allowing childs should contribute to the parent's content model.

Another formulation for the constraints imposed by the a element's transparent content restriction is that an HTML document can be validated by removing all a start-element and end-element tags (but keeping child content of a elements), and validating the result document against a tight HTML5 grammar lacking an a element. In sgmljs.net SGML, this notion can be directly expressed using SGML LINK, ie. by declaring an explicit link process projecting a permissive variant of HTML as source markup into a restrictive HTML variant as result markup, and by declaring a link rule that maps all source elements to the same-named target elements, respectively, except for a and other HTML elements with transparent content.

From a practical point of view, though, to facilitate HTML validation using mainstream SGML parsers (which don't support SGML LINK and/or don't perform validation and tag inference on result markup events of link processes), it might be desirable to express the effective content model restrictions imposed by transparent content using DTD declarations.

Fortunately, it can be easily shown that HTML's a element (and also HTML's other elements having transparent content) behave in a tame and modular way that doesn't interact with the content model into which an a element is placed:

  • Since a is member of the flow and phrasing element categories, and the content model declarations of HTML only ever use a elements as part of flow or phrasing content, rather than as a element in isolation, and the flow content and phrasing content productions are interpreted as "any sequence of the respective elements, or the empty sequence", a can only be used as an optional content token, hence can't be put into a content position in such a way that it changes the interpretation of content model tokens with respect to validation and tag inference.

  • Since the flow and phrasing categories (with the exception of the p element which is covered below) only contain elements which have either void content (ie. have declared content EMPTY in SGML parlance), or don't admit tag omission, flow and phrasing content (up to p elements) is always fully-tagged markup without omitted tags. Hence, markup wrapped into a elements can't alter the interpretation of neighbouring content (and the effect of omitting a p end-element tag within an a element, even if it were allowed, can't interact with neighbouring content either, since the a element doesn't admit tag omission).

  • The p element is the only element in flow content that admits end-tag omission, and hence could be seen to interact with the placement of an a element in a non-modular fashion. The HTML5 specification addresses this problem specifically in the content model description for the p element (by eg. disallowing p end-tag omission in child content of a).

The "transparent content" constraint is inherent in the fundamental construction of the HTML vocabulary as phrasing (inline) content wrapped in flow (block-level) content, and already sufficiently represented in this HTML5 DTD via exclusion exceptions as discussed above.

The Permissive HTML5 DTD

The permissive HTML5 DTD is an extract of the restrictive DTD, and edited to make use of WebSGML's IMPLYDEF ELEMENT ANYOTHER and other features.

IMPLYDEF ELEMENT ANYOTHER is an SGML declaration property allowing (like IMPLYDEF ELEMENT YES), undeclared elements to occur in document instances. If an undeclared element x is encountered in a document, it will be treated as if it were declared <!ELEMENT x - O ANY>, which means that any element or character data is permitted as child content of x, and moreover, that x's end-element tag can be omitted.

In regular SGML, end-element tag omission is only considered if either

  • a parent element's end-element tag is encountered, or
  • no more element content is expected within the parent (ie. because the declared content model of the parent element is completed at the context position already and doesn't admit optional elements at its end), or
  • an element is encountered that isn't allowed to occur (ie. because it is being excluded at the context position).

Declaring an end-tag omisssion indicator (the letter O in the declaration) can't have consequences for the latter two cases if neither a content model nor content exclusion exceptions have been declared on the respective element. WebSGML's implied default declaration for elements, <!ELEMENT x - O ANY>, has neither; however, WebSGML's IMPLYDEF ELEMENT ANYOTHER feature, when activated, will treat undeclared elements as completed and infer an end-element tag (if missing in content), if an element is immediately followed by a start-element tag for the same element.

Tag omission on paragraphs

For example, consider end-element tag omission for the p element as used in HTML:

<p>This is the first paragraph.
<p>This is the second.

SGML (when IMPLYDEF ELEMENT ANYOTHER is active and no declaration for the p element is present) will parse this as if

<p>This is the first paragraph.</p>
<p>This is the second.

had been specified, eg. SGML will infer the </p> end-element tag upon seeing the <p> start-element tag for the second paragraph.

Tag omission on document-level elements

Moreover, when put in a context where paragraph elements are usually expected in HTML, the second omitted </p> end-element (and additional missing elements) is inferred as well.

For example, putting the two paragraph paragraph into a text file, and (optionally) adding a <title> element as follows

<title>Tag omission in HTML paragraphs
<p>This is the first paragraph.
<p>This is the second.

then parsing it using either the restrictive or the permissive DTD for HTML5.1 is treated as if the following had been specified:

<html>
	<head>
		<title>Tag omission in HTML paragraphs</title>
	</head>
	<body>
		<p>This is the first paragraph.</p>
		<p>This is the second.</p>
	</body>
</html>

The html, head, and body tags are inferred based on the following DTD declarations for these well-known elements

<!ENTITY % metadata "base|link|meta|noscript|script|style|template|title">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>

Note that all of these element declarations, by using O O (double capital letter O for "omission") as tag omission indicator, declare the respective element to admit both start- and end-tag omission.

SGML will

  1. create an html element if it isn't there, knowing that an html document element must be the first content element in an HTML file

  2. infer the head element if it isn't there, since the content model requires it at the start of html's content

  3. place the title element as child content of head, since it's allowed/expected to occur here acoording to heads model group expression (formed by substituting %metadata; by the base|link...|title string also used in the restrictive HTML5.1 DTD)

  4. infer the end-element tag for title and for the head element (since the p element following the title isn't allowed to occur in those)

  5. infer the body element (if it isn't there), since it's required to follow the head element

  6. finally, place the p elements or other flow content

  7. infer the end-element tags for p, body, and html at the end of the document.

Tag omission in lists

Tag omission in ul and ol elements are based on the following DTD declarations:

<!ELEMENT ul - - (li)* +(%scripting)>
<!ELEMENT ol - - (li)* +(%scripting)>

The ul, and ul elements themselves don't admit tag omission. But the li element, being not declared at all, can use IMPLYDEF ELEMENT ANYOTHER based end-element tag inference, analogous to the p element example above.

For example, the following HTML fragment

<ul>
<li>A list item
<li>Another list item
<ul>

is parsed as if the end-tags the li elements had been specified:

<ul>
<li>A list item</li>
<li>Another list item</li>
<ul>

Tag omission in definition lists

Definition lists are declared as follows (using the same declaration as the restrictive DTD)

<!ELEMENT dl - - (dt+,dd+)* +(%scripting)>
<!ELEMENT dt - O ANY -(dt,dd)>

A declaration for dd is absent, meaning dd is using end-tag omission afforded by IMPLYDEF ELEMENT ANYOTHER.

Basic tag omission in definitions lists works as follows:

<dl>
<dt>Term 1
<dd>Definition 1
<dd>Definition 1.2
</dt>

Like other lists and block-level elements, definition lists must be started with an explicit start-element tag; HTML also requires an explicit start-element tag for dl.

  • the dt element is terminated as soon as the dd start-element tag is encountered, because dd appears as excluded element in dt's content exceptions

  • the first dd element is terminated by the subsequent dd element due to the end-tag inference afforded by IMPLYDEF ELEMENT ANYOTHER

  • the second dd element is terminated along with the </dl> end-element tag.

The following example illustrates a basic difference between the permissive DTD and the restrictive DTD with respect to dd end-tag omission:

<!-- explicit dd end-element tag to stop dl nesting -->
<dl>
<dt>Term 1
<dd>Definition 1
<dd>Definition 1.2</dd>
<dt>Term 2
<dd>Definition 2
</dl>

The example starts with the same sequence of markup events as before; the second dd element must be explicitly closed when using the permissive HTML5.1 DTD (but can be omitted when using the restrictive HTML5.1 DTD).

This is because, by default in HTML, definitions (dd elements) may contain nested definition lists (and thus dt elements); hence, mere occurence of dt anywhere in dd content can't be used as a signal to end dd elements.

The restrictive DTD, on the other hand, can infer dd's end-element tag because it has knowledge of all HTML elements that can appear directly as content of dd (so the SGML parser can terminate dd when it sees dt).

If, in the above fragment, the </dl> end-element had been omitted, then parsing using the permissive DTD would result in a nested dt/dd sequence as the child content of the second dd element as follows:

<dl>
	<dt>Term 1</dt>
	<dd>Definition 1.1</dd>
	<dd>Definition 1.2
		<dt>Term 2</dt>
		<dd>Definition 2.1</dd>
		<dd>Definition 2.2</dd>
	</dd>
</dl>

More often than not in a context where tag omission is used in authoring, definition list nesting is probably undesired. To force a dt element to end a dd element (in the permissive DTD, which doesn't "know" all HTML elements), is to disallow dt as child content of dd. While this assumption is not for the the permissive DTD to make in general, it can be easily achieved using a declaration in the document's internal subset such as

<!DOCTYPE html SYSTEM ".. URL of html51 permissive DTD ..." [
	<!ELEMEMT dd - O -(dd|dt)>
]>

This will declare the dd element to SGML, hence stop IMPLYDEF ELEMENT ANYOTHER inference of dd end-element tags. To compensate for it, dd is excluded; moreover, dt is excluded as well, which will have the effect that dd is automatically closed when a dt (or dd) element is encountered.

While the HTML5.1 DTDs don't make use of it, SGML also supports implementing start-element tag omission on the dt element, allowing even shorter forms of writing definition lists such as

<!-- not supported with the HTML5.1 DTDs provided here -->
<dl>
Term
<dd>Definition
</dl>

Tag omission in tables

The table-related elements are declared as follows in the permissive HTML5.1 DTD

<!ELEMENT table - - (caption?,colgroup*,thead?,(tbody*|tr+),tfoot?) +(%scripting;)>
<!ELEMENT thead - O (tr*) +(%scripting;)>
<!ELEMENT tbody O O (tr*) +(%scripting;)>
<!ELEMENT tfoot - O (tr*) +(%scripting;)>
<!ELEMENT tr - O (td|th)* +(%scripting;)>
<!ELEMENT th - O ANY -(th|td|tr)>

Similar to the limitations with respect to dd end-element tag omission explained before, this declaration restricts th but not td elements (which are allowed to contain nested tables according to the HTML specification); hence </td> end-element tags must be placed before <tr> elments starting new table rows:

<table>
	<tr>
		<th>table-head 1
		<th>table-head 2
	<tr>
		<th>table-head 1
		<td>table-head 2</td>
	<tr>
		<td>table-data 1
		<td>table-data 2
</table>

Again, the restrictive DTD doesn't have this limitation, and it can be switched of in a document using the permissive DTD by using a custom declaration for the td element; for exampple, if the internal subset contains the declaration

<!ELEMENT td - O ANY -(table|th|td|tr)>

then the otherwise required </td> end-element can be omitted (at the expense of disallowing nested tables, which however is usually a recommended practice anyway).

Aggressive use of tag omission in table content is discouraged; for more info on table models, see also the section on table content representation in the restrictive DTD.

Other element declarations

Apart from declarations necessary to drive tag omission of the html, head, dl, ul, ol, table, and thead and some of their immediate child content elements as explained before, the permissive HTML5.1 DTD contains

  • declaration for the div, span, and section elements to switch of end-tag omission due to IMPLYDEF ELEMENT ANYOTHER on these

  • declaration of the script and style elements with CDATA declared content (the same declaration declaration as used in the full HTML5.1 DTD)

  • of the remaining elements, those element declarations for elements with declared content EMPTY in the restrictive DTD, ie. the HTML void elements base, link, meta, hr, br, wbr, img, param, source, track, area, col, input, keygen, menuitem

Attribute declarations

Only attribute declarations with "unusual" parsing rules are included in the permissive HTML5.1 DTD, other attributes in content are permitted due to IMPLYDEF ATTRIB YES; these are HTML5's Boolean attributes and other enumerated attributes.

Specifically, the hidden and the lang global attribute are declared on every element using the declaration

<!ATTLIST #ALL hidden (hidden) #IMPLIED lang NMTOKEN #IMPLIED>

along with a couple of attribute declarations for the enumerated attributes of HTML, declared on individual elements.

Note that element declarations for the elements on which enumerated attributes can occur aren't necessarily included in the permissive DTD (ie. only insofar as necessary for other purposes).

The SGML declaration for HTML5

Character Set

While HTML5.1 can be transferred in any desired transport encoding, the HTML5.1 document character set (and that of prior versions of HTML) is tied to ISO 10646 ("Unicode") in that numeric character entity reference are always interpreted as UCS code points, irrespective of the transport encoding; in practice, HTML is most oftenly stored and transferred using the UTF-8 encoding.

The HTML5 specification itself normatively references http://www.unicode.org/versions/ rather than a particular version; in practical use on the web, the latest Unicode version is implicitly assumed, but only characters supported by targetted browsers and fonts are actually used.

Among the mapped-to character entity references, HTML5's Named character references (covered below in depth) includes variant sequence and eg. the U+0205F MEDIUM MATHEMATICAL SPACE. and U+222DA LESS-THAN EQUAL TO OR GREATER-THAN code points.

These feature/code points were introduced with Unicode 3.2, corresponding to ISO/IEC 10646-2:2001 with ammendments; practically, Unicode 4.0 (corresponding to ISO/IEC 10646:2003 should be considered the minimal UCS version suitable for HTML5.

SGML uses ISO/IEC 2012 code switching and the International register of coded character sets to be used with escape sequences as the principal means to inform the parser about the character set coding of a document.

ISO/IEC 2012 identifies the UTF-8 coding system using the ESC % G (or ESC 2/5 3/7) designating sequence, or alternatively, ESC 2/5 4/7, ESC 2/5 4/8, or ESC 2/5 4/9 for the respective UCS implementation levels).

Therefore, the preferred SGML document character set for HTML5 is

ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UTF-8 Level 3//ESC 2/5 4/9

(or a variant for a newer UCS version, for identitifying implementations levels, or both).

However, (Open)SP SGML doesn't accept this designating sequence for UTF-8. According to OpenSP - Character sets to use UTF-8 with (Open)SP, it's necessary to tell the parser to use a bit combination transformation format by setting the SP_BCTF environment variable.

bctf stands for bit combination transformation format by and is a concept introduced with HyTime 2nd. Ed. General Facilities - FSIDR, building on ISO 8879's clause E 3.1 - Code Extension Facilities.

Thus, for interoperability, the following document character set is used in the SGML declaration for HTML5.1:

ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UCS with implementation Level 3//ESC 2/5 2/15 4/6

Note SGML always interprets numeric character entity references as character numbers, ie. as single UCS code points independently of the document encoding (such as UTF-8 etc.) being used.

For an overview of ISO 2022, please refer to ECMA-35 Character Code Structure and Extension Techniques, which is identical to ISO/IEC 2022:1994 and made available by ECMA International for public access.

Naming

Quoting from a SGML syntax reference:

The SGML declaration admits case folding/canonicalization to be switched on for these two groups of name tokens individually

  • entities (SYNTAX NAMECASE ENTITY YES/NO)

  • and for all other name token uses (SYNTAX NAMECASE GENERAL YES/NO)

but not for more granular subsets of the other name tokens.

In particular, name casing rules cannot be applied to ID values and IDREF/IDREFS value reference attributes in isolation, as would be required for HTML5 and HTML5.1, which doesn't apply case normalization to these values, but to other name tokens.

This is a long standing problem, because attributes modelled as ID values and value references (and hence subject to case-folding) are also prominently used in eg. href values containing fragment identifiers/URLs, where they are always interpreted case-sensitively and haven't IDREF declared value semantics. Modelling ID values and value references as ID and IDREF/IDREFS. respectively, thus requires addditional attention in using ID values, value reference, and fragment identifiers.

If the HTML5.1 DTD is used for post-processing content for web delivery with a generic SGML processor such as (Open)SP, the choice of whether to apply case-folding to ID values and value references also affects CSS selectors and selectors used in JavaScript code for DOM manipulation, and the issue cannot be solved by merely changing namecase settings on an individual document basis, since external links with fragment identifiers are affected as well.

For an SGML declaration to use with HTML5.1, then, a decision has to be made with respect to the possible choices for NAMECASE GENERAL, with the trade-offs as described next. The HTML5.1 DTD can be used with both variants. Of course, a third option would be to declare ID values and value references as generic CDATA attributes, which don't get case-folding applied; this workaround, however, isn't pursued further for this HTML5.1 DTD.

Note that entity names in HTML are always handled case-sensitively (NAMECASE ENTITY NO).

Using NAMECASE GENERAL YES
this applies case-folding to elements, attributes, and all other tokens except entity names
is the traditional setting (and that used by older DTDs for earlier revisions of HTML, and hence the least surprising one)

bogusly renames (case-folds) ID values and value references, but does so in a locally consistent way ie. any ID reference value points to the same referenced element after renaming, up to the case that two or more ID values are used which only differ in their chosen namecasing (ie. map to the same canonicalized name after case-foloding when they didn't before); in this respect, the mapping/case-folding isn't isomorphic, but this isn't usually considered a problem

as a workaround to the discussed fragment identifier issue, it should be ensured out-of-band that any ID value and value reference, and any fragment identifier (in href and other attributes) always uses the lowercase variant

Using NAMECASE GENERAL NO
doesn't case-folds any names/name tokens, which isn't formally correct with respect to HTML's stated case-folding semantics, but is arguably a better match to contemporary/idiomatic HTML usage, and is also a more faithful and practical expression of naming rules with respect to HTML's upcoming custom elements feature (see below)

could be seen as problem for HTML's usage of foreign elements/vocabularies, in that HTML wants to apply case-folding even to foreign elements; but note that browsers apply case-folding in interpreting CSS selectors, DOM API calls and CSS-like selectors in DOM API methods such as querySelectorAll() anyway, so doesn't add more complication to this situation (case-folding primarily poses a problem for SVG elements such as linearGradient)

since this leaves ID values and value references as-is, the (rather severe) ID case-folding issue doesn't apply

If NAMECASE GENERAL YES is used, HTML5.1's rules for Converting a string to uppercase (and similar definitions for case-sensitive matching, which are used in numerous further places throughout the specification), mandate that only the alphabetic characters in IRV (US-ASCII) are subject to case folding/canonicalization; hence HTML5.1's basic naming rules can be precisely expressed in SGML (in fact, HTML's restricted case folding can be seen as a consequence of SGML's historic limitations in this respect).

In HTML5.1, permitted characters for constructs equivalent or analogous to those mentioned above where SGML is using name tokens are defined on a case-by-case basis. For this analysis, HTML's element and attributes names and ID and class values are considered, and used as for all other name tokens as well.

Name Characters

For element names, according to 8.1.2 Elements, the permitted characters are defined by the specification itself as the alphanumeric IRV (US-ASCII) characters:

HTML elements all have names that only use alphanumeric ASCII characters.

While decimal digits aren't actually used in any element defined in the specification (so could be left out from being allowed), this definition obviously isn't meant to constrain the elements admitted to occur in HTML in general, but only a statement about the elements defined by the specification itself.

The definition of what characters an element is allowed must change when additional vocabularies are used with HTML, and is also challenged by WHATWG's custom element specification specification, which, while not enjoying universal browser support, still is expected to become part of a future revision of W3C's HTML specification.

While the custom elements specification isn't included in W3C HTML5.1, it gives rise to an informed decision for a definition of a set of admitted characters in element names, which are given in the following EBNF productions:

PotentialCustomElement Name ::=
	[a-z] (PCENChar)* '-' (PCENchar)*

PCENChar ::=
	"-" | "." | [0-9] | "_" | [a-z] | #xB7 | [#xC0-#xD6] |
	[#xD8-#xF6] | [#xF8-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
	[#x203F-#x2040] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
	[#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
	[#x10000-#xEFFFF]

These ENBF productions are similar to those used by the The XML specification Version 1.0 Fifth ed., with the following modifications and additional constraints:

  • uppercase letters A-Z (but not other uppercase characters outside IRV/ASCII) are disallowed
  • as opposed to XML, also the U+00B7 MIDDLE DOT character is admitted

  • custom element names, like custom attribute names, are required to start with a lowercase letter (in IRV/US-ASCII)
  • custom element names must contain the - (U+002D HYPHEN-MINUS) character

  • custom element names must be different from those SVG and MathML element names containing - (U+002D HYPHEN-MINUS) as identified in https://html.spec.whatwg.org/#valid-custom-element-name, ie. must be different from annotation-xml, color-profile, font-face, font-face-src, font-face-uri, font-face-format, font-face-name, and missing-glyph

The requirement that custom elements must not contain uppercase letters, and must begin with a lowercase letter seems odd, considering that HTML wants to apply case-folding to an element anyway, even to foreign elements. Its existence points to an implementation detail in HTML parsers, eg. that scanning for end-element tags in script data and other CDATA-like contexts is performed by string searching where parts of the search string is formed from the lowercase letters of the element name to search for. This is similar to what SGML parsers do for determining the delimiter-in-context terminating the content of elements with declared content CDATA.

It can't be captured per se by an SGML declaration, which always admits both the lowercase and uppercase letters; but it's also not necessary do so, since it's enforced by there simply not being an element declaration in the restrictive HTML5.1 DTD, when NAMECASE GENERAL NO is used.

Moreover, it's not possible to disallow the digit characters as name characters in SGML (but they're never allowed as name start characters).

The constraint with respect to conflicts with SVG and MathML elements is adequately represented in the restrictive HTML5.1 DTD and by the fact that declarations for these elements, when present, will prevent re-declaration attempts/name clashes.

An SGML declaration can't express the constraint related to presence of at least one - (U+002D HYPHEN-MINUS) character but it's also not necessary to do so since custom elements must (or should. in the case of the permissive DTD) be declared eg. in the internal subset or other custom declaration set. What can be achieved here is to allow - (U+002D HYPHEN-MINUS) as name character but not as name start character.

All said, the relevant portion of the SGML declaration for capturing HTML's naming rules as expressed by the custom elements specification looks as follows (where all character and character ranges are notated using decimal character numbers, and the SGML declaration makes use of the extended naming rules:

NAMING
	LCNMSTRT ""
	UCNMSTRT ""
	NAMESTRT
		46           -- FULL STOP --
		95           -- LOW LINE --
		183          -- MIDDLE DOT --
		192-214      -- #xC0-#xD6 --
		216-246      -- #xD8-#xF6 --
		248-893      -- #xF8-#x37D --
		895-8191     -- #x37F-#x1FFF --
		8204-8205    -- #x200C-#x200D --
		8255-8256    -- #x203F-#x2040 --
		8304-8847    -- #x2070-#x328F --
		11264-12271  -- #x2C00-#x2FEF --
		12289-55295  -- #x3001-#xD7FF --
		63744-64975  -- #xF900-#xFDCF --
		65008-65533  -- #xFDF0-#xFFFD --
		65536-983039 -- #x1000-#xEFFFF --
	LCNMCHAR ""    
	UCNMCHAR ""
	NAMECHAR
		45           -- HYPHEN-MINUS --

Note this definition admits the 0+002D HYPHEN-MINUS character only as NAMECHAR, ie. as the second or subsequent character of name tokens, but not as their first character, as required by HTML's naming rules.

Note also this declaration can be used with either NAMECASE GENERAL YES or NAMECASE GENERAL NO.

Foreign names

HTML extends case-folding to foreign elements from SVG and MathML as well:

In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.

As far as this HTML5.1 DTD is concerned, this doesn't pose additional problems not already addressed.

To avoid awkward problems when transferring and roundtripping foreign content (which in many cases will be created using XML-based tools), however, web authors are advised to nevertheless use foreign element names in their native, case-aware form in HTML content, as well as in CSS and JavaScript DOM API selectors.

Custom attributes names

The HTML5.1 specification makes the following restriction with respect to custom data attributes:

  • that XML naming rules for custom data (specifically, the name must not contain the : (U+003A COLON) character, must otherwise represent a valid XML name, and must not begin with xml (case-insensitively))

  • contains no uppercase letters
  • must begin with data- and has at least one character after the hyphen character.

The XML (Fifth Ed.) naming requirement is honored up to the differences already discussed above. Note that according to this definition, custom data attribute names must not contain eg. the MIDDLE DOT character, whereas custom element are allowed to contain it, but this isn't represented in the SGML declaration (and is considered accidental).

The other requirements can be honored when using the Restrictive DTD eg. by declaring custom data attribute in the internal subset.

ID values

W3C's HTML5.1 specification says with respect to the ID attribute.:

The value must be unique amongst all the IDs in the element's home subtree and must contain at least one character. The value must not contain any space characters.

and

There are no other restrictions on what form an ID can take; in particular, IDs can consist of just digits, start with a digit, start with an underscore, consist of just punctuation, etc.

According to this definition in isolation, the following examples are valid uses of ID values (however, might be constrained by other rules for eg. parsing attributes and by semantic rules):

<div id='>
<div id=">
<div id==>
<div id=?>
<div id=" ">
<div id=1>
<div id=#>
<div id=top>
<div id=<>
<div id=&>
<div id=/>
<div id=:(>

From these examples it's clear that HTML's choice of allowable ID values is unlikely to help with interoperability within a larger markup context.

While the following characters aren't formally invalid, since ID values are most commonly used

  • as href targets in anchors (using fragment identifiers rather than ID value references)

  • in CSS ID selectors (and selectors used in JavaScript DOM APIs)

for practical considerations (ie. avoiding having to use entity references), attribute values should generally not contain the following characters:

  • U+0022 QUOTATION MARK
  • U+0027 APOSTROPHE

Moreover, though not invalid, ID values can be expected to rarely contain the # (U+0023 NUMBER SIGN) character, since it's used to separate a fragment identifier from its preceding part in a URL.

Since it's then necessary to restrict the set of expected characters in name tokens into something more representative of actual HTML usage, in order to avoid arbitrary choices, usage constraints mandated by CSS selector and URL syntax as discussed below are natural candidates to base an informed choice with respect to naming on. As it turns out, however, the rules for name tokens in HTML4 already cover the set of reasonably usable characters pretty well, whereas HTML5/HTML5.1's liberal rules are easily seen as less recommendable.

HTML4 rules

As the least arbitrary/suprising thing to do, it seems reasonable to resort to HTML4's naming rules which are also generally still recommended for new content by other resources. For example, the classic restriction on HTML ID values and class names, as eg. stated on MDN's article on the id attribute is:

Note: Using characters except ASCII letters and digits, '_', '-', and '.' may cause compatibility problems [...]

Moreover, ID values starting with a digit should be avoided when it's desired to target these using CSS selectors.

This is because CSS selectors can target these values in ID and class selectors only with escaping, which wasn't supported in a portable way in browsers using earlier CSS revisions, or not at all.

From Selectors Level 2.1:

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code [...]. For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F".

According to HTML 5.1: 2. Common infrastructure, HTML's space characters are the U+0020 SPACE, U+0009 CHARACTER TABULATION, U+000A LINE FEED, U+000C FORM FEED, and U+000D CARRIAGE RETURN characters (rather than the larger set of Unicode spaces).

By these definition, we can leave the naming parameters precisely as above. Note it isn't surprising that we recover HTML4's rules here, but it is mildly surprising still that for the definition of custom elements rather conservative (compared to the rules for ID values) rules appealing to restrictions of SGML name tokens were used.

These constraints still hold in current CSS specifications as shown next.

From Selectors Level 3 (which is also identical to the relevant part of Cascading Style Sheets Level 2 (CSS 2.2) Specification Working Draft:

ident     [-]?{nmstart}{nmchar}*
name      {nmchar}+
nmstart   [_a-z]|{nonascii}|{escape}
nonascii  [^\0-\177]
unicode   \\[0-9a-f]{1,6}(\r\n|[ \n\r\t\f])?
escape    {unicode}|\\[^\n\r\f0-9a-f]
nmchar    [_a-z0-9-]|{nonascii}|{escape}

This definition is still identical to that used in the older CSS specification (see above), eg. by this definition

  • the set of usable start characters in IRV/US-ASCII are the letters and _

  • all other characters need escaping
  • a name token starting with a double - (0+002B HYPHEN-MINUS) characters isn't allowed, nor is a name token starting with - (0+002B HYPHEN MINUS) followed by a digit.

Fragment identifier syntax

Allowing a larger set of characters needs to be weighed against rules for URI fragment and CSS selector parsing.

That is, since ID values are used to form fragment identifiers in URLs, surely a recommendation about which characters an ID value should or shouldn't contain must look at the characters permitted in URL fragments for an informed choice, since characters not in this set need URL escaping (eg. percent-encoding).

From RFC3987:

fragment     = *( pchar / "/" / "?" )
pchar        = unreserved / pct-encoded / sub-delims / ":" / "@"
unreseserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

sub-delims   = "!" / "$" / "&" / "'" / "(" / ")"
             / "*" / "+" / "," / ";" / "="

The earlier RFC2396 specification also reduces the set further by recommending to avoid the following unwise characters anywhere in an URL:

unwise       = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Applying these recommendations reduces the set of usable characters in fragment identifiers to these:

/  ?                          (fragment production)
:  @                          (pchar production)	
-  .  _  ~                    (unreserved production)
!  $  &  (  )  *  +  ,  ;  =  (subdelims production)

From these, the following are disallowed because of conflicts with markup parsing rules:

/ (SOLIDUS)
used in markup for bogus XML-style empty elements, so shouldn't be used as last character in elements
? (QUESTION MARK)
used to start a processing instruction, so that it can't be used as first character of element names, and shouldn't be used as last character either
! (EXCLAMANTION MARK)
used to start a comment/markup declaration, so that it can't be used as first character of elements
& (AMPERSAND)
used in markup for entity references, hence can't be used in entity names anywhere
; (SEMICOLON)
used to terminate an entity reference, hence can't be used in entity names anywhere
= (EQUALS-SIGN)
can't be used as part of attribute names.

Moreover, the following characters should be disallowed because of conflicts with selector syntax

# (NUMBER SIGN)
disallowed anywhere, since used as CSS ID selector, and moreover could be used as right operand of a descendant selector
. (FULL STOP)
disallowed anywhere, since used as CSS class selector, and moreover could be used as right operand of a descendant selector, hence shouldn't be allowed to occur as namechar
@ (AT-SIGN)
disallowed as name start character, since used in CSS at-rules
+ (PLUS), > (GREATER-THAN), ',' (COMMA), ~ (TILDE)

disallowed anywhere, since used as CSS combinators, and ~ is also used in the ~= attribute selector such that an attribute name needs escaping when it contains ~

* (ASTERISK)

disallowed anywhere, since used as CSS wildcard, and also used in the *= attribute selector such that an attribute name needs escaping when it contains *

: (COLON)
disallowed anywhere, since used for CSS pseudo-element and pseudo-class selectors

arriving at the following characters

- (HYPHEN-MINUS), _ (LOW LINE)
for the characters allowed anywhere in a name token

and

! (EXCLAMATION MARK), @ (AT-SIGN)
as name characters but not name start characters.

Note the . (FULL STOP) character is excluded by these rules even though allowed as namechar of element names above.

It's unwise to alter HTML's name start characters because HTML parsers could depend on it being in the fixed range of lowercase letters (see above).

This leaves ! and @ as additional characters for HTML name characters but not name start characters. This is considered too small a benefit to warrant changing HTML's naming rules at all.

Named character references

The HTML5 and HTML5.1 specifications names 2231 Named character references available for use in HTML5/HTML5.1, and lists to the following public identifiers as reference:

  • -//W3C//DTD XHTML 1.0 Transitional//EN

  • -//W3C//DTD XHTML 1.1//EN

  • -//W3C//DTD XHTML 1.0 Strict//EN

  • -//W3C//DTD XHTML 1.0 Frameset//EN

  • -//W3C//DTD XHTML Basic 1.0//EN

  • -//W3C//DTD XHTML 1.1 plus MathML 2.0//EN

  • -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN

  • -//W3C//DTD MathML 2.0//EN

  • -//WAPFORUM//DTD XHTML Mobile 1.0//EN

Note that the link provided by the HTML5.1 specification https://www.w3.org/TR/2016/REC-html51-20161101/entities.dtd to point to the consolidated SGML/XML entity set covering the reference sources is broken, but for the purpose of constructing the restrictive DTD is replaced with the canonical source for these entity sets at https://www.w3.org/2003/entities/2007/htmlmathml-f.ent anyway (referred to as htmlmathml-f.ent in subsequent discussion) as advised in https://www.w3.org/TR/xml-entity-names/.

While the vast majority of these entities is provided for MathML support, the entity set is part of HTML5/HTML5.1 and thus available anywhere in HTML content.

The Permissive HTML5.1 DTD doesn't include these declarations as entity set; instead, named entity references are provided using WebSGMLs (ISO 8879 Annex K) "predefined data character entities" feature.

Using predefined entities can capture the notion that a web browser has built-in support for displaying HTML5's named character references, and that at no point in the process are predefined data character entities actually substituted into numeric character entity references.

The latter notion is required since WebSGML predefined data character entities can only map to a single character number (UCS code point), rather than a sequence of code points; for example, sgmljs.net SGML, like (Open)SP SGML's osgmlnorm program, will reproduce predefined data character entities as-is (ie. as entity reference, rather than replaced character(s)) to result markup when used as a command-line application with the proper target SGML declaration options.

Even though the actual replacement character number to which a predefined entity is mapped is thus inconsequential for SGML processors, those entities mapping to multi-code point sequences aren't included in the predefined character entities because predefined entity reference can't be redefined; if their use is desired, these can simply be included in the internal subset of a document by using eg.

<!ENTITY % htmlmathml-f PUBLIC
	"-//W3C//ENTITIES HTML MathML Set//EN//XML"
	"http://www.w3.org/2003/entities/2007/htmlmathml-f.ent">

%htmlmathml-f;

as advised in https://www.w3.org/TR/xml-entity-names/.

The affected entities are listed below, along with a recommendation as to their replacement

Variation Sequences

Of those listed by the HTML5 specification, the following entities (listed along with their base code points) are combined with U+FE00 VARIATION SELECTOR-1 into variation sequences

  • caps (U+2229 INTERSECTION)

  • cups (U+222A UNION)

  • gvertneqq/gvnE (U-2269 GREATER-THAN BUT NOT EQUAL TO)

  • lates (U+2AAD LARGER THAN OR EQUAL TO)

  • lesg (U+22DA LESS-THAN EQUAL TO OR GREATER-THAN)

  • lvertneqq/lvnE (U+2268 LESS-THAN BUT NOT EQUAL TO)

  • smtes (U+2AAC SMALLER THAN OR EQUAL TO)

  • sqcaps (U+2293 SQUARE CAP)

  • sqcups (U+2294 SQUARE CUP)

  • varsubsetneq, vsubne (U+228A SUBSET OF WITH NOT EQUAL TO)

  • varsubsetneqq, vsubnE (U+2ACB SUBSET OF ABOVE NOT EQUAL TO)

  • varsupsetneq, vsubpne (U+228B SUPERSET OF WITH NOT EQUAL TO)

  • varsupsetneqq, vsubnE (U+2ACC SUPERSET OF ABOVE WITH NOT EQUAL TO)

Note all used variation sequences in htmlmathml-f.ent are standardized variants,

To be able to use references to these, simply use their respective base code point or base entity, respectively. For example, caps (U+2229 INTERSECTION, U+FE00 VARIATION SELECTOR-1) supposed to render INTERSECTION with serifs should be replaced by just cap (U+2229 INTERSECTION); using the respective base character is also the recommended practice in the Unicode Variation Seqeuences FAQ.

Combining Marks

Moreover, the following combining marks are used together with other characters (eg. as in race which maps to the sequence U+223D REVERSED TILDE, U+0331 COMBINING MACRON BELOW):

  • U+0333 COMBINING DOUBLE LOW LINE

  • U+20E5 COMBINING REVERSE SOLIDUS OVERLAY

  • U+0338 COMBINING LONG SOLIDUS OVERLAY

  • U+20D2 COMBINING LONG VERTICAL LINE OVERLAY

  • U+0331 COMBINING MACRON BELOW

in the following entities:

  • acE: U+223E, U+0333

  • bne: U+003D, U+20E5

  • bnequiv: U+2261, U+20E5

  • nang: U+2220, U+20D2

  • nbump: U+224E, U+0338

  • nbumpe: U+224F, U+0338

  • nconqdot: U+2A6D, U+0338

  • nedot: U+2250, U+0338

  • nesim: U+2242, U+0038

  • ngE, ngeqq: U+2267, U+0038

  • ngeqslant, nges: U+247E, U+0038

  • nGt: U+226B, U+20D2

  • nGtv: U+226B, U+0338

  • nlE, nleqq: U+2266, U+0338

  • nleqslant, nles: U+2A7D, U+0338

  • nLt: U+226A, U+20D2

  • nLtv: U+226A U+0338

  • NotEqualTilde: U+2242, U+0338

  • NotGreaterFullEqual: U+2267, U+0338

  • NotGreaterGreater: U+2269, U+0338

  • NotGreaterSlantEqual: U+2A7E, U+0338

  • NotHumpDownHump: U+224E, U+0338

  • NotHumpEqual: U+224F, U+0338

  • notindot: U+22F5, U+0338

  • notinE: U+22F9, U+0388

  • NotLeftTriangleBar: U+29CF, U+0388

  • NotLessLess: U+226A, U+0388

  • NotLessSlantEqual: U+2A7D, U+0388

  • NotNestedGreaterGreater: U+2AA2, U+0388

  • NotNestedLessLess: U+2AA1, U+0388

  • NotPrecedesEqual, npreceq: U+2AAF, U+0388

  • NotRightTriangleBar: U+29D0, U+0388

  • NotSquareSubset: U+228F, U+0388

  • NotSquareSuperset: U+2290, U+0388

  • NotSubset, nsubset: U+2282, U+0388

  • NotSucceedsEqual, nsucceq: U+2AB0, U+0388

  • NotSucceedsTilde: U+2AB0, U+0388

  • NotSuperset: U+2283, U+0388

  • nparsl, npre: U+2AFD, U+20E5

  • npart: U+2202, U+0338

  • nrarrc: U+2993, U+0338

  • nrarrw: U+219D, U+0338

  • nsce: U+2AB0, U+0338

  • nsubE, nsubseteqq: U+2AC5, U+0338

  • nsupE, nsupseteqq: U+2AC6, U+0338

  • nsupset: U+2283 U+20D2

  • nvap: U+224D, U+20D2

  • nvge: U+2265, U+20D2

  • nvgt: U+003E, U+20D2

  • nvle: U+2264, U+20D2

  • nvlt: U+003C, U+20D2

  • nvltrie: U+22B4, U+20D2

  • nvrtrie: U+22B5, U+20D2

  • nvsum: U+223C, U+20D2

  • race: U+223D, U+0331

  • vnsub: U+2282, U+20D2

  • vnsup: U+2283, U+20D2

For the two most commonly used combinations, it is recommended to use the following replacements instead:

  • bne (U+003D EQUAL TO, U+20E5 COMBINING REVERSE SOLIDUS OVERLAY): use the equiv or Congruent entity instead

  • bnequiv (U+2261 IDENTICAL TO, U+20E5 COMBINING REVERSE SOLIDUS OVERLAY): use the nequiv or NotCongruent entity instead

Other entities making use of combining marks are not represented in the set of predefined entities in the HTML5.1 SGML declaration for the Permissive DTD.

Special entities

Apart from those, the following HTML5 named character references are mapped to text elements with more than a single code point:

fjlig (U+0066 LATIN SMALL LETTER F, U+006A LATIN SMALL LETTER J)

as an f-j ligature is missing from Unicode, supposedly, fjlig is used as a placeholder for authors until it does; but this doesn't appear to be happening, which is why it's not represented in the set of predefined entities (note that rendering a ligature is mostly performed by the used font anyway whenever the code point sequence appears in text data)

thickspace (U+205F MEDIUM MATHEMATICAL SPACE, U+200A HAIR SPACE)

use emsp13 instead (emsp13 is mapped to U+2006 THREE-PER-EM SPACE as replacement, and, according to Unicode spaces, medium mathematical space is 4/18 em = 1/4.5em, hence medium mathematical space + hair space = 4/18 em + 2/18 em = 6/18 em = 1/3 em)

Removing the point sequences from the set of predefined entities allows us to specify UCS implementation level 1, provided no other multi-code point sequences are used in user data.

Note XML Entity Definitions for Characters (3rd Edition) contains a similar, but not identical, analysis.

XML-style empty-elements

For the SGML declaration for HTML5.1, WebSGML's FEATURES MINIMIZE EMPTYNRM YES setting is used.

This allows

  • bogus XML-style empty elements for eg. HTML's meta elements and other HTML "void" elements, and thus matches HTML's prescribed parsing exactly

  • XML-style empty elements in foreign content on elements having declared content EMPTY (in addition to all other elements).

Unquoted attribute values

HTML5/HTML5.1 allows omitting quote characters (called the LIT or LITA delimiters in SGML parlance, which are the double or single quote characters in the reference concrete syntax, resp.) on any attribute where the attribute value happens to not contain space characters, not just the boolean attributes.

HTML's rules match, and are inherited from, SGML's where in addition, the behaviour is further subject to the FEATURES MINIMIZE ATTRIB OMITNAME setting of the SGML declaration, which, only when YES, allow this on undeclared attributes.

For the Restrictive DTD, this setting is NO; for the Permissive DTD, this setting is YES.

Omitting quotes on attributes other than the Boolean attributes and other enumerated attributes wasn't very common in HTML until relatively recently and was discouraged in earlier specifications; the latest WHATWG specification text, however, makes aggressive use of it.

Custom elements

The Custom elements specification, while not part of W3C HTML5.1, was already considered when defining the set of allowable characters in element names and other name tokens; the information provided here gives a general statement on their intended use with the Restrictive DTD, but isn't meant as a comprehensive treatment of the subject.

For use with the Restrictive DTD for HTML5.1, the recommended way to handle custom elements is to declare these in the internal subset (a declaration for these must be present when using the Restrictive DTD but doesn't need to be when using the Permissive DTD).

To be able to actually use custom elements in browsers, these must be registered using JavaScript, but this is considered out of scope here.

Note that apart from custom elements, there are also customized builtin elements. For these, on the HTML/markup side of things, not much has to be done as behaviours with respect to tag omission and omission of attribute names are already declared. New attributes can be declared as needed using sgmljs.net SGML; (Open)SP SGML, on the other hand, doesn't allow multiple ATTLIST declarations for the same element, so for (Open)SP SGML, custom attributes must be manually added to the respective 'ATTLIST` declaration in the Restrictive HTML5.1 DTD.

On the other hand, custom elements that aren't replacements for built-in elements, since required to support global attributes including their special parsing rules just like regular HTML elements (see eg. https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes#id), are expected to receive attribute declarations for global attributes via WebSGML ATTLIST #ALL declarations. While these are already used for the Permissive, but not Restrictive DTD yet, it is thus expected that a future revision of this DTD (for eg. HTML5.2, which has been announced for 4Q2017) will be published without support for (Open)SP SGML or other processors lacking ATTLIST #ALL support (note the Permissive DTD already requires ATTLIST #ALL all in this revision).

UTF-8 encoding declaration

The W3C document at https://www.w3.org/International/questions/qa-html-encoding-declarations recommends to always declare the encoding of a document using a meta element with either the charset or the http-equiv (for HTML4) attribute when using UTF-8.

However, this is not considered a DTD issue and isn't enforced by this DTD. It can't, because SGML won't infer whole elements (it will only infer start- and end-element tags), and it shouldn't, because there are of course legitimate reasons to use a different encoding, even though UTF-8 is the preferred one.

Uniform header values and other global site or route defaults can be implemented using SGML templates, however.