This document describes a transcription of WHATWG's HTML specification prose into an SGML DTD. If follows WHATWG snapshots as published by W3C (WHATWG itself doesn't publish stable snapshots of its specifications). The resulting DTD contains declaratinos for all elements and attributes of HTML, SVG, MathML, and the ARIA attributes.
The result DTD is primarily useful for checking/validating and normalizing HTML. In SGML applications, it's common (and the point of using SGML in the first place) to define custom DTDs containing application-specific grammar and processing rules, including for generic HTML applications such as outlining, metadata extraction, search result formatting, paging, templating, etc. It is expected to create custom DTDs based on the HTML DTD provided here.
This HTML 5 DTD is a straightforward translation of the specification text for HTML's content model and tag omission rules into DTD grammar rules. The specification text is included as SGML comment along with the translated DTD rule for reference; where HTML content model rules are represented incompletely, a note is included in the SGML comment for the declaration as well Attribute default values aren't included. Predefined HTML entities (character entity references) as explained in attribute defaults arent't included either. The DTD designed to be used with the SGML declaration for HTML5.
WHATWG/W3C's HTML5 specification document states that
XML DTDs cannot express all the conformance requirements of this specification. Therefore, a validating XML processor and a DTD cannot constitute a conformance checker. Also, since neither of the two authoring formats defined in this specification are applications of SGML, a validating SGML system cannot constitute a conformance checker either
but doesn't provide examples where SGML can't specifically be used for checking. For XML DTDs and other XML-based schema languages it's easy enough to conclude these can't describe HTML for their lack of a way to express empty elements, omitted tags, omitted attribute names, and unquoted attributes, among other things.
For SGML, on the other hand, it's less obvious, hence this text discusses parsing and validation issues of modern HTML using SGML in depth.
The HTML5.1 specification text introduces the elements of HTML using a taxonomic approach and presents a classification and accompanying Venn diagram depicting inclusion relationships between HTML element categories derived from definitions contained in earlier HTML DTDs.
It is felt that the fundamental grammatical construction
of the HTML vocabulary as inline content wrapped
in an optional layer of block-level content isn't
quite apparent in this definition. Earlier HTML
DTDs included the following definition for
flow
capturing this in a rather straightforward way:
<!ENTITY % flow "%block; | %inline;">
While HTML5 lacks it, a definition for "block content" rises again by subtracting those elements from the "flow content" category that aren't also in the "phrasing content" (inline) category.
Nesting of phrasing into flow elements is about the
most basic property of the HTML grammar. In SGML,
the flow/phrasing hierarchy is expressed by declaring
the content model
of flow elements as allowing %phrasing;
, where
%phrasing;
, like in earlier HTML DTDs, is
substituted into a string such as a|abbr|...
containing all phrasing elements as a name group.
For example, the element declaration for the p
element
is as follows:
<!ELEMENT p - O (#PCDATA|%phrasing;)* -(%flow_only;)>
meaning that
the content of p
elements can be any sequence of
text content (#PCDATA
) and phrasing elements, and
flow content isn't just forbidden in direct child
content of p
(via admitting %phrasing;
elements
only), but also isn't admitted anywhere in descendant
content of p
, as expressed by the -(%flow_only;)
SGML exclusion exception, and
the end-element, but not the start-element tag of p
can be omitted, as declared in p
s tag omission
indicator - O
(see Tag Omission)
where the flow_only
parameter entity contains
block-level elements as described above, ie.
flow elements that aren't also phrasing elements.
In older HTML DTDs, formally only block-level
elements can appear directly in a HTML document body
;
phrasing content had to be wrapped into at least
a paragraph (or generic block-level div
)
container element.
However, browsers never inferred block-level elements when they where missing in content (or made their presence visible in the DOM). Essentially, this constraint was never enforced.
The HTML5.1 grammar follows actual browser behaviour,
in that any flow content, including phrasing content,
is formally accepted as direct child of the body
element.
The HTML5 specification lists tag omission rules
for each applicable element (or element combination)
individually. For example, the specification text
for the p
element reads
A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.
For this HTML5 DTD, the text description is transcribed into SGML tag omission indicators in the most straightforward way, based on whether start-element tag, end-element tag, or both start- and end-element tag omission is allowed at all, avoiding verbose enumeration of specific elements.
For example, the tag omission rules for p
are represented
using this simple SGML element declaration:
<!ELEMENT p - O (#PCDATA|%phrasing;)*>
where the -
(hyphen/minus, meaning "not omissible") and
the O
(letter O, meaning "omissible") prescribes p
s
start-element and end-element tag omission behaviour,
respectively, and %phrasing;
expands to the
string a|abbr|...
containing HTML's phrasing elements.
This element declaration will make an SGML parser end
a paragraph element if encountering any element not in the
phrasing
category, or an end-element tag which isn't
balanced within p
's content, thereby capturing HTML's
parsing rules.
As presented in the HTML5 specification, the choice
of explicitly enumerated elements that cause the
paragraph element to be terminated may seem arbitrary,
but is in fact (up to potential minor omissions
considered errors) the set of HTML elements that are in the
"flow" category, without also being in the "phrasing"
category. This isn't surprising, since HTML parsing rules
were originally specified using SGML grammars such as
the above. By recovering HTML's original parsing rules
from HTML5's specification text, we conclude
that HTML5's parsing rules are represented
adequately, and more succinctly, since avoiding
redundantly specifying p
-terminating elements.
This interpretation is also supported by the fact
that for the HTML5.1 specification update (vs. HTML5),
the new details
, figcaption
, figure
and menu
elements (which are flow-only elements) but not the
new picture
element (which can also be
used in phrasing content) have been added
to the set of p
-terminating elements.
End-element tag omission is commonly used in HTML in the following situations
on list items when directly followed by other list items
or when followed by ul
or ol
end-element tags
on definition list terms or definitions, when followed
by other terms or definitions, or when followed by a dl
end-element tag
in head
content, upon encountering an element
that can't be placed into head
element content
at the end of a document. for html
, body
, and
other elements (these elements also allow start-tag
omission as discussed in the context of the permissive DTD in
Tag omission on document-level elements,
and the rules provided there fully apply to the
restrictive DTD as well)
All of these uses are parsed/validated by this HTML5 DTD in the expected way and in the same way as W3C's HTML5 validation software (as far as can be told).
Start-element tag omission (other than in table content),
is only allowed on the html
, head
, and body
elements,
which is trivially supported by this HTML5.1 DTD in the
expected way; see
Tag omission on document-level elements.
Tag omission in table content deserves a closer examination.
The relevant specification text reads as follows:
The table element
Content: In this order: optionally a caption element, followed by zero or more colgroup elements, followed optionally by a thead element, followed by either zero or more tbody elements or one or more tr elements, followed optionally by a tfoot element, optionally intermixed with one or more script-supporting elements.
Tag omission: Neither tag is omissible.
The thead element
Content: Zero or more tr and script-supporting elements
Tag omission: A thead element's end tag may be omitted if the thead element is immediately followed by a tbody or tfoot element.
The tfoot element
Content: Zero or more tr and script-supporting elements
Tag omission: A or element's end tag may be omitted if the tbody element is immediately followed by a tbody element.
The tbody element
Content: Zero or more tr and script-supporting elements.
Tag omission: A tbody element's start tag may be omitted if the first thing inside the tbody element is a tr element, and if the element is not immediately preceded by a tbody, thead, or tfoot element whose end tag has been omitted. (It can't be omitted if the element is empty.). A tbody element's end tag may be omitted if the tbody element is immediately followed by a tbody or tfoot element, or if there is no more content in the parent element.
By the specification text, the following HTML fragment (representing typical use of tag omission in table content) is allowed and accepted by both this HTML5 DTD, as well as by W3C's HTML5 validation software:
<table>
<thead>
<tr>ONE
<tr>TWO
<tr>THREE
<tbody>
<tr>One...
<tr>Two...
<tr>Three...
</table>
However, the specification text for tbody
also admits
omission of a tbody
start-element tag if the first thing inside
the tbody element is a tr element, and if the element is not
immediately preceded by a tbody
, thead
, or tfoot
element
whose end tag has been omitted.
The relevant part of the content model for tbody
's parent
element, table
, admits either zero or more tbody
elements,
or one or more tr
elements.
Hence, table content such as the following
<table>
<thead>
<tr>ONE
<tr>TWO
<tr>THREE
</thead>
<tr>One...
<tr>Two...
<tr>Three...
</table>
is valid according to at least two rules in
the HTML specification: the tr
elements following thead
can be parsed by HTML5's parsing rules either as
tr
elements being placed directly into table
content,
or as an instance of a tbody
element with omitted
start- and end-element tags.
This HTML5 DTD (as does Mozilla's validator.nu software
and web browsers) interprets such content as an instance
of the former case, and requires that a tbody
start-element
is present in content to force the latter interpretation.
The ambiguous production rule for tbody
, as stated in
the HTML5.1 specification, can never apply in the absence
of start-element tags for tbody
.
Presumably, ambiguousness of tag omission rules for table content is inadvertent; even the specification text (chapter 6.7.6) itself seems to use tag omission in table models incorrectly. The W3C HTML5.1 specification contains this fragment (the leading paragraph is included for locating the text place in the document):
<p>The element <var>host element</var> to create for the media is the element given in
the table below in the second cell of the row whose first cell describes the media. The
appropriate attribute to set is the one given by the third cell in that same row.</p>
<table>
<thead>
<tr>
<th> Type of media
<th> Element for the media
<th> Appropriate attribute
<tr>
<td> Image
<td> <code>img</code>
<td> <code>src</code>
<tr>
<td> Video
<td> <code>video</code>
<td> <code>src</code>
<tr>
<td> Audio
<td> <code>audio</code>
<td> <code>src</code>
</table>
Note this table doesn't contain a body which however
isn't the point here. Going by how the table is rendered
and by analogy with other text places containing table use
which do have a tbody
element specified, it
can be concluded what is probably intended here is
that the first tr
element should be treated as major
table heading row, while subsequent rows should be
treated as table body rows.
The rule for tag omission of the thead
element
reads A thead element's end tag may be omitted
if the thead element is immediately followed by a tbody
or tfoot
element (in both the W3C HTML5.1, and the
current WHATWG specification text), hence we expect
the above fragment to be rejected since the rule
does not say that a thead
end-element tag can
be omitted if followed by a table
end-element tag
(when other parsing rules for end-element omission
state this kind of condition explicitly).
However, the fragment is happily accepted by W3C's validation software, and hence slipped into the published specificaton text; the HTML5 DTD follows validator.nu here and accepts it as well.
In an attempt to interpret HTML5's informally stated
syntax description, we note that a sentence such as
A thead
element's end tag may be omitted if the thead
element is immediately followed by a tbody
or tfoot
element
is inherently self-referential since the thead
element's
end isn't yet established while assessing it's end-element
omission status, hence whatever follows it in content isn't
either (the definition of tbody
s start-element tag omission
stated earlier has a similar problem).
Further analysis of HTML5's expected behaviour requires a stated formal semantics for interpreting its syntax rules (standard semantics such as co-inductive/well-founded semantics can't be applied here without further qualification). Given the lack of such semantics, and that multiple (quite obvious) flaws were found in table content models on a cursory look already, and given mildly surprising results when using the reference validation software, further discussion of HTML5's table parsing rules seems hopeless, and isn't expected to contribute to a definition of an interoperable table content model.
Hence, while the HTML5 DTD behaves the same as the reference validation software, authors are advised to not rely on tag omission in table content beyond basic idiomatic usage as described.
datalist
element
The datalist element's content definition has changed from previous releases. It's specification text now reads
The datalist element
Content: Either: phrasing content. Or: Zero or more option and script-supporting elements.
and the mapping into an element declaration is as follows:
<!ELEMENT datalist - - ((#PCDATA|%phrasing;)*|(option|script)*)>
Note that only the script
element, rather than any
script-supporting element is supported. The script-supporting
elements in HTML5.1 includes the template
element. However,
the template
element is also phrasing content. When using
%scripting;
(which includes both the script
and the
template
element, and is used as parameter entity reference elsewhere
in the HTML5.1 DTD), the grammar for datalist
will become
1-ambiguous.
This means that upon encountering a template
element
in a datalist
parent element, the parser cannot decide which
of the two branches declared in the choice submodels of datalist
s
grammar rule is to be selected for subsequent parsing.
This is not permitted in SGML, and either disallowed or
undesirable in other markup languages as well.
Semantically, it doesn't make sense to use template
elements in datalist
child content, hence
the allowance of template
is considered accidental
(or a consequence of HTML5's grammar presentation which
doesn't facilitate basic automated grammar checks).
HTML5 lists the following as
boolean attributes:
reversed
, ismap
, typemustmatch
,default
,autoplay
,
muted
,checked
,readonly
,required
,multiple
,disabled
,
selected
,readonly
,required
,reversed
,disabled
,autofocus
,
autoplay
,novalidate
,formnovalidate
,hidden
,lang
,async
,
defer
, and the truespeed
attribute on the deprecated marquee
element.
Note the paused
attribute isn't a boolean attribute.
HTML5's boolean attributes are modelled as SGML attribute declarations having a singleton name group as declared attribute value, ie. an enumerated value where the name group contains only a single value.
For example, the selected
(and disabled
)
attribute on HTML5's option
element, according to the HTML5 specification,
must be specified as eg. <option selected>
, and the HTML5
DOM API is supposed to treat the selected
attribute as either true
or false
. If a false
value is desired, the selected
attribute must be omitted in an attribute specification.
In SGML, this is modelled as
<!ATTLIST option selected (selected) #IMPLIED>
meaning the attribute name can be omitted.
According to the declaration, specifying
<option selected>
is equivalent to specifying
<option selected=selected>
or
<option selected="selected">.
Note that, formally, WebSGML (ISO 8879 Annex K) allows use of the same name token as enumerated value for multiple attribute declarations. In prior versions of SGML, the following wasn't valid:
<!ATTLIST x a (true|false) #IMPLIED>
<!ATTLIST y a (true|false) #IMPLIED>
because the name tokens true
and false
could only be
used in a single attribute declaration; one had to
declare:
<!ATTLIST (x|y) a (true|false) #IMPLIED>
At the same time, pre-Annex K SGML only allowed a single attribute list declaration for a given element.
WebSGML relaxes this constraint by allowing
declared attributes to be asseambled from multiple
ATTLIST
declarations for the same element(s), and
enumerated attributes (token name groups) to contain the same token in different attribute declarations (including on the same element); note that if a token is declared on multiple elements, it cannot be used with omitted attribute name
However, (Open)SP doesn't seem to implement Annex K in
this respect, and will reject multiple ATTLIST
declarations
on the same element and also multiple declarations for the
same name token.
While sgmljs.net SGML doesn't have this restriction, for interoperability, the HTML 5 DTD generator outputs the boolean attributes inline along with other attributes on an element.
contenteditable
and spellcheck
attributes
The contenteditable
and spellcheck
attributes are handled
special; these can have their values omitted in HTML5 but cannot
be modelled in SGML like the "boolean" attributes, because they
both use true
/false
as enumerated values, and thus can't be
handled via SGML MINIMIZE ATTRIB OMITNAME
(which requires that
name tokens be unique among those declared for all attributes
in a DTD, not just those declared on a given element, in order
to make use of OMITNAME
).
The HTML5 specification lists
area
, base
, br
, col
, embed
, hr
, img
, input
,
keygen
, link
, meta
, param
, source
, track
, and wbr
as void elements.
Note that the HTML5 specification suggests that the (legacy) elements
basefont
, bgsound
, and frame
should also be treated as void,
but these don't have declared content EMPTY
in this HTML 5 DTD.
HTML5's void elements happen to coincide with those labelled as
having the Empty
content model in the section on individual elements
in the specification text.
Void elements are expected to neither have child content nor an
end-element tag, and are adequately modelled as SGML elements
with declared content EMPTY
.
In the HTML5 specification text, void elements are
described as having "No end tag" (in addition to having "Nothing"
as content); however, an element declared EMPTY
in SGML
usually isn't qualified with an end-tag omission indicator,
since having declared content EMPTY
isn't considered a tag
minimization feature in SGML.
"Self-closing" elements are uses of HTML void elements
which have a slash before the >
(U+003E GREATER THAN SIGN
)
character (eg. the STAGC
delimiter in SGML) in the start-element
tag, such that void elements appear as XML-style empty elements.
For example, in:
<link href="..." rel="stylesheet" type="text/css" />
the /
(U+002F SOLIDUS
) character is bogus.
This syntax was used in the past to make HTML processable using XML parsers, and its use is generally discouraged.
While tolerated (ignored) by HTML5 on void elements,
in the HTML5 SGML DTD, self-closing elements are subject
to the EMPTYNRM YES
and other settings in the
SGML declaration for HTML5
which synthesize HTML's parsing rules in this respect.
The child content of the style
element
is modelled as SGML CDATA
declared content, meaning that
any markup delimiters are ignored (up to the sequence of
terminating characters as discussed in
Script data)
Note that legacy HTML also might assume CDATA
semantics
with xmp
, noembed
, and noframes
content.
The child content of the textarea
element is modelled
as RCDATA
declared content, which behaves
same as CDATA
, except delimiters for entity
references (SGML's ERO
and ERE
delimiters,
ie. the U+0026 AMPERSAND
and the U+003B SEMICOLON
characters in the SGML reference concrete syntax,
respectively) are recognized, and entity
references formed by those are substituted by
replacement text.
Note sgmljs.net SGML substitutes only internal parsed
entities in an RCDATA
context.
Dispensing with earlier DTDs for HTML declaring
the content of the script
element CDATA
,
in this DTD for HTML5.1, child content of the
script
element is modelled as (#PCDATA)
content
model for security reasons.
HTMLs script
element and it's historic use has been known
to be a problem since at least
as early as 1996
(cf. Joe English's posts) and
keeps being problematic
today.
According to the HTML5.1 specification,
after transitioning from
script data less-than sign state (which is the state
reached in script
element child data after having encountered an
unescaped <
(U+003C LESS THAN) character),
'/' (U+002F SOLIDUS) transitions the parsing state to the script data end tag open state, which in turn will be sent to the
script data end tag name over any ASCII character
In the script data end tag name an HTML5 parser is supposed to check
the longest sequence of ASCII characters for a case-insensitive match
of script
(and finish the script
element if this is the case).
For SGML, on the other hand, expected behaviour is to end
CDATA
or RCDATA
on a "delimiter-in-context" ie. <
(U+003C LESS-THAN SIGN), followed by /
(U+002F SOLIDUS)
followed by a name start character (ignoring other irrelevant
delimiter-in-context cases here), irrespective of whether
the generic identifier started by the name start character
is actually script
(or, more generally, the same that started
CDATA
/RCDATA
), whereas HTML5 is supposed to treat character
data looking like an end-element tag but not actually representing a
</script>
tag as part of the character content of script
.
For example, the following HTML fragment
<script>
document.innerHTML =
"<html><head><title>Oops</title><body>Pwnd</body></html>"
</script>
will be parsed by HTML as a single script
element with
child content, but, provided the script element has been declared
<!ELEMENT script CDATA>
will be parsed by SGML as
the <script>
start-element tag,
the text document.innerHTML = "<html><head><title>Oops
,
the </title>
end-element tag,
the </body>
start-element tag,
the text Pwnd
,
the </body>
end-element tag,
the </html>
end-element tag, and
the </script>
end-element tag.
While a DTD using either CDATA
or (#PCDATA)
for the
script
element will reject this particular sequence of
markup events (because script
can't have it's end-tag
omitted), in general, this behaviour is undesired since
it could be used to mount script injection attacks.
As explained in the HTML 5 specification
(Restrictions for contents of script element),
there's an additional, rather unneccesary,
twist in HTML5's dealing with script data,
in that what looks like starting an SGML comment
(ie. the character sequence <!--
) within
script data will make the parser enter
script data escaped dash dash state,
which is only exited on a subsequent -->
character sequence, potentially parsing well
beyond what looks like the regular end of
the script element.
That is, SGML's <!--
and -->
character sequences
are recognized as JavaScript comment start- and end-
sequences, respectively; presumably, this was an
(ill-conceived) attempt in early JavasScript revisions
to present uniform commenting syntax accross HTML
and JavaScript.
It is problematic since it is completely invisible to SGML. Needless to say, this style of script comments is an avoidable XSS attack vector in web pages.
SGML has never recognized comments in CDATA or RCDATA
at all, hence this cannot be handled by SGML other than
by using a regular (#PCDATA)
content model.
Treating script
content as (#PCDATA)
can be inconvenient,
since it requires that verbatim occurences of the <
(U+003C LESS-THAN SIGN) character might have to be specified
using the <
entity, or that all or parts of the child
content is put into CDATA
or RCDATA
marked sections.
If this turns out to be a problem, the declaration for
script
can easily be changed to CDATA
to re-establish
former behaviour.
For maximum security in applications handling user-provided content (eg. assumed to potentially contain malicious script), it is recommended that, in addition,
admissability of the script
element should be
controlled in a custom DTD using exclusion exceptions
blocking control flow transfer to JavaScript injected into event handler attributes can be implemented for content descending from a given parent element by either
using #FIXED
event handler attribute values, or
Driving the latter technique from DTDs alone has
limitations since DTD declarations apply globally,
but can be performed adequately when using
context-dependent LINK
processing and/or SGML
templating.
Using these techniques allows more granular
control over where script is allowed in HTML content
compared to
Content Security Policy (CSP),
(which however isn't a finalized recommendation at this
point). For example, CSP's policy of ignoring/not
executing script
elements in content can be expressed
by placing an SGML exclusion exception on body
, but
could also be applied at a more granular level in
arbitrary body
child element(s), which isn't possible
with CSP.
Arguably, by disabling event handler attributes in HTML alltogether, Content Security Policy questions basic assumptions about locality and modularity of event handling in the JavaScript/HTML authoring model in such a way as to make it almost pointless.
It is particularly unrealistic/uneffective for current practices with syndicated and/or ad-driven web sites, including those that offload eg. discussion forums to third-party services (which happen to be the primary input vectors for user-contributed content).
From Chrome's Content Security Policy (CSP) page:
[Blocking inline script] does, however, require you to write your code with a clean separation between content and behavior (which you should of course do anyway, right?)
To which it must be answered that "separation of concerns" is certainly not a prime characteristic of the content-driven web next to composibility of web content.
SVG and MathML DTDs are conditionally included via the
svg_conditional_inclusion
and mathml_conditional_inclusion
parameter entities.
The DTDs included will be accessed from the canonical system identifier URLs of their most recently published DTDs.
In the absence of an INCLUDE
value
for these parameters, the svg
and math
elements
will be declared ANY
.
Note that for inclusion of foreign XML vocabularies,
EMPTYNRM YES
should be specified in the SGML declaration to
cater for XML-style empty elements (which are made
extensive use of already in basic SVG documents).
This HTML5.1 DTD doesn't declare attribute defaults. Instead,
it always declares #IMPLIED
as default value.
Generally speaking, making subtle distinctions with respect to whether attribute (and other) defaults are specified with their default values in content explicitly as opposed to left unspecified is considered a bad practice, since whether an attribute is specified or implied isn't adequately represented in eg. DOM and similar APIs lacking attribute defaults and other type-related metadata. About the only applications in need of access to this kind of information are HTML authoring and developer tools.
However, the HTML5.1 specification recommends (specifically, for the ARIA attributes) to not specify their default values explicitly (ie. unless their actual value differs from the default).
For a similar reason, the preferred representation for HTML5 named character references are as predefined character entities, rather than entity sets.
This section contains clarifications to the interpretation of certain attribute-related specification text passages. As it turns out, no additional grammar constraints are derived.
CONREF
attribute semantics
In SGML, #CONREF
attributes are used to control that elements
should be treated like EMPTY
elements on a case-by-case basis if
the respective #CONREF
attribute is specified.
As a mechanism for conditionally void elements,
the HTML5.1 specification could be interpreted to mean
that eg. the span
attribute of the
colgroup
element should behave as #CONREF
attribute,
given the specification's wording:
(Content model of
colgroup
)If the span attribute is present: Nothing
in combination with the fact that having "Nothing" as content is de facto used synonymously with being a void element throughout the specification text.
However, consistent with the
HTML4 DTD, #CONREF
attributes are not used in this HTML5 DTD.
There are a number of additional case where #CONREF
could be applied, and also some irregular cases where
#CONREF
cannot express HTML5's desired behaviour.
For example, what the specification text says can
be interpreted to mean that the src
attribute on
script
elements should be treated as a #CONREF
attribute. But script must always be terminated
using </script>
tags explicitly, even if a src
attribute is specified; this is also enforced by
this HTML 5 DTD.
CURRENT
attribute semantics
According to the specification text, the title
attribute,
if it is omitted from an element
then it implies that the title attribute of the the nearest ancestor HTML element with a title attribute set is also relevant to this element
This is considered a HTML semantic, not syntactic rule, and isn't
represented in the HTML5 DTD, in line with browser DOM APIs not
handling title
different from other regular attributes.
In SGML, "inheriting" title
and other attributes could
roughly be modelled using #CURRENT
default semantics
(even though an unspecified #CURRENT
attribute takes
its value from any preceding use of that attribute in
document order, rather than just from ancestor elements).
HTML5 reserves attributes starting with data-
as private use
attributes (meaning that those won't ever be used by any HTML
attribute and will be preserved in a constructed DOM in web browsers).
Their special naming requirements cannot and need not be represented in the HTML5.1 DTD as such, since custom attributes can be declared in the internal subset or another declaration set.
If full validation is desired, data-
-attributes must
be declared.
If full attribute validation isn't desired (IMPLYDEF ATTLIST YES
is specified in the SGML declaration or otherwise), such as when
only the permissive HTML DTD is used,
a custom attribute doesn't have to be declared.
If a custom data attribute is declared, it should be either declared having
CDATA
value, and have its value always quoted, or should be
declared otherwise appropriate with respect to how it's used in content
(ie. with respect to omitting quotation characters).
HTML5 makes the restriction that XML naming rules for custom data
apply: the name must not contain the :
(U+003A COLON) character,
and must otherwise represent a valid XML name, and must not
begin with xml
(case-insensitively). These rules aren't enforced
by SGML.
Like SVG ant MathML, ARIA attributes can optionally be
included via the if_aria
parameter entity.
Note the HTML5 DTD doesn't include element-specific
declarations for ARIA attributes (ie. restrictions of
the permitted role
attribute values for individual
elements).
Note also, like for other attributes, no attribute defaults are implied for the ARIA attributes (which is recommended practice by the ARIA and the HTML specifications).
Arguably, the most characteristic element of HTML is the anchor (a
)
element for hyperlinking. Apart from hyperlinks, the core HTML
content models are merely variants of paragraph, flow, table, and
other content models that were already in wide use
for marking up documents for printing in the pre-WWW era.
In HTML5, the content model of the a
element (and that
of map
, ins
, and del
) is specified as transparent, which
means that a
"inherits" (for lack of a better word) its parent
content model: permitted child content is determined by the parent element's
permitted child content (which can inherit its permitted content
from its parent element, in turn, and so on).
HTML5's transparent content model concept is an artifact
of adding the ability to annotate any piece of flow content
as hyperlink, rather than just phrasing content as in previous
HTML specifications. In practice, it is commonly used when eg. an
image or icon and some belonging text (and possibly some
background) is hyperlinked to a common target
using a single a
element, rather than having to wrap
the image, the text, and boilerplate content into a
elements
individually. Note an extension for using href
and other
attributes of anchors on any HTML element (with the expectation
that this makes those elements behave like hyperlinks,
thereby effectively making the anchor element redundant), was
proposed
already around 2008. Arguably, had this been further pursued,
the HTML vocabulary could have been made much simpler and more
orthogonal, but it was rejected at the time on the grounds that
browser vendors already had implemented the ability to place
anchor tags around most HTML elements.
As applied in the HTML specification, transparent content just
means that eg. an a
element accepts either just phrasing or
also flow content as child content, depending on whether it
appears in a flow or phrasing context, respectively
(and similarly with map
, ins
, a del
). The concept
of transparent content, however, doesn't extend to
arbitrary elements, because it trivially conflicts with
the content model descriptions of those elements into
which elements having transparent content can be placed.
Specifically, for any element which accepts an element
having transparent content, it needs to be stated whether,
and how, elements wrapped into transparent content-allowing
childs should contribute to the parent's content model.
Another formulation for the constraints imposed by the a
element's
transparent content restriction is that an HTML document can be validated
by removing all a
start-element and end-element tags (but keeping
child content of a
elements), and validating the result document
against a tight HTML5 grammar lacking an a
element. In sgmljs.net
SGML, this notion can be directly expressed using SGML LINK
, ie. by
declaring an explicit link process projecting a permissive variant of HTML
as source markup into a restrictive HTML variant as result markup,
and by declaring a link rule that maps all source elements to the
same-named target elements, respectively, except for a
and other
HTML elements with transparent content.
From a practical point of view, though, to facilitate HTML validation
using mainstream SGML parsers (which don't support SGML LINK
and/or
don't perform validation and tag inference on result markup events
of link processes), it might be desirable to express the effective
content model restrictions imposed by transparent content
using DTD declarations.
Fortunately, it can be easily shown that HTML's a
element
(and also HTML's other elements having transparent content)
behave in a tame and modular way that doesn't interact with
the content model into which an a
element is placed:
Since a
is member of the flow and phrasing element categories,
and the content model declarations of HTML only ever use a
elements
as part of flow or phrasing content, rather than as a
element in
isolation, and the flow content and phrasing content productions
are interpreted as "any sequence of the respective elements, or the
empty sequence", a
can only be used as an optional content token,
hence can't be put into a content position in such a way that it
changes the interpretation of content model tokens with respect to
validation and tag inference.
Since the flow and phrasing categories (with the exception of the
p
element which is covered below) only contain elements which have either
void content (ie. have declared content EMPTY
in SGML parlance),
or don't admit tag omission, flow and phrasing content (up to
p
elements) is always fully-tagged markup without omitted tags.
Hence, markup wrapped into a
elements can't alter the interpretation
of neighbouring content (and the effect of omitting a p
end-element
tag within an a
element, even if it were allowed, can't interact
with neighbouring content either, since the a
element doesn't
admit tag omission).
The p
element is the only element in flow
content that
admits end-tag omission, and hence could be seen to interact with
the placement of an a
element in a non-modular fashion. The HTML5
specification addresses this problem specifically in the content
model description for the p
element (by eg. disallowing p
end-tag omission in child content of a
).
The "transparent content" constraint is inherent in the fundamental construction of the HTML vocabulary as phrasing (inline) content wrapped in flow (block-level) content, and already sufficiently represented in this HTML5 DTD via exclusion exceptions as discussed above.
While HTML5.1 can be transferred in any desired transport encoding, the HTML5.1 document character set (and that of prior versions of HTML) is tied to ISO 10646 ("Unicode") in that numeric character entity reference are always interpreted as UCS code points, irrespective of the transport encoding; in practice, HTML is most oftenly stored and transferred using the UTF-8 encoding.
The HTML5 specification itself normatively references
http://www.unicode.org/versions/
rather than a particular
version; in practical use on the web, the latest
Unicode version is implicitly assumed, but only
characters supported by targetted browsers and fonts are
actually used.
Among the mapped-to character entity references, HTML5's
Named character references (covered below in depth)
includes variant sequence and eg. the U+0205F MEDIUM MATHEMATICAL SPACE
.
and U+222DA LESS-THAN EQUAL TO OR GREATER-THAN
code points.
These feature/code points were introduced with Unicode 3.2, corresponding to ISO/IEC 10646-2:2001 with ammendments; practically, Unicode 4.0 (corresponding to ISO/IEC 10646:2003 should be considered the minimal UCS version suitable for HTML5.
SGML uses ISO/IEC 2012 code switching and the International register of coded character sets to be used with escape sequences as the principal means to inform the parser about the character set coding of a document.
ISO/IEC 2012 identifies the UTF-8 coding system
using the ESC % G
(or ESC 2/5 3/7
) designating sequence,
or alternatively, ESC 2/5 4/7
, ESC 2/5 4/8
, or
ESC 2/5 4/9
for the respective
UCS implementation levels).
Therefore, the preferred SGML document character set for HTML5 is
ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UTF-8 Level 3//ESC 2/5 4/9
(or a variant for a newer UCS version, for identitifying implementations levels, or both).
However, (Open)SP SGML doesn't accept this designating sequence
for UTF-8. According to
OpenSP - Character sets
to use UTF-8 with (Open)SP, it's necessary to tell the parser to use
a bit combination transformation format by setting the SP_BCTF
environment variable.
bctf
stands for bit combination transformation format by
and is a concept introduced with HyTime 2nd. Ed. General Facilities - FSIDR,
building on ISO 8879's clause E 3.1 - Code Extension Facilities.
Thus, for interoperability, the following document character set is used in the SGML declaration for HTML5.1:
ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UCS with implementation Level 3//ESC 2/5 2/15 4/6
Note SGML always interprets numeric character entity references as character numbers, ie. as single UCS code points independently of the document encoding (such as UTF-8 etc.) being used.
For an overview of ISO 2022, please refer to ECMA-35 Character Code Structure and Extension Techniques, which is identical to ISO/IEC 2022:1994 and made available by ECMA International for public access.
Quoting from a SGML syntax reference:
The SGML declaration admits case folding/canonicalization to be switched on for these two groups of name tokens individually
entities (
SYNTAX NAMECASE ENTITY YES/NO
)and for all other name token uses (
SYNTAX NAMECASE GENERAL YES/NO
)but not for more granular subsets of the other name tokens.
In particular, name casing rules cannot be applied to
ID
values and IDREF
/IDREFS
value reference
attributes in isolation, as would be required for HTML5 and
HTML5.1, which doesn't apply case normalization to these values,
but to other name tokens.
This is a
long standing
problem, because attributes modelled as ID
values and
value references (and hence subject to case-folding)
are also prominently used in eg. href
values containing
fragment identifiers/URLs, where they are always interpreted
case-sensitively and haven't IDREF
declared value semantics.
Modelling ID
values and value references as ID
and IDREF
/IDREFS
.
respectively, thus requires addditional attention in using
ID
values, value reference, and fragment identifiers.
If the HTML5.1 DTD is used for post-processing content for web delivery with a generic SGML processor such as (Open)SP, the choice of whether to apply case-folding to ID values and value references also affects CSS selectors and selectors used in JavaScript code for DOM manipulation, and the issue cannot be solved by merely changing namecase settings on an individual document basis, since external links with fragment identifiers are affected as well.
For an SGML declaration to use with HTML5.1, then, a decision
has to be made with respect to the possible choices for
NAMECASE GENERAL
, with the trade-offs as described next.
The HTML5.1 DTD can be used with both variants. Of course,
a third option would be to declare ID values and value
references as generic CDATA
attributes, which don't get
case-folding applied; this workaround, however, isn't pursued
further for this HTML5.1 DTD.
Note that entity names in HTML are always handled case-sensitively
(NAMECASE ENTITY NO
).
NAMECASE GENERAL YES
bogusly renames (case-folds) ID values and value references,
but does so in a locally consistent way ie. any ID
reference value
points to the same referenced element after renaming, up to the case that
two or more ID
values are used which only differ in their chosen
namecasing (ie. map to the same canonicalized name after case-foloding
when they didn't before); in this respect, the mapping/case-folding isn't
isomorphic, but this isn't usually considered a problem
as a workaround to the discussed fragment identifier issue, it should
be ensured out-of-band that any ID value and value reference,
and any fragment identifier (in href
and other attributes) always
uses the lowercase variant
NAMECASE GENERAL NO
could be seen as problem for HTML's usage of foreign
elements/vocabularies, in that HTML wants to apply
case-folding even to foreign elements; but note that
browsers apply case-folding in interpreting CSS selectors,
DOM API calls and CSS-like selectors in DOM API methods
such as querySelectorAll()
anyway, so doesn't
add more complication to this situation
(case-folding primarily poses a problem for SVG elements
such as linearGradient
)
since this leaves ID values and value references as-is,
the (rather severe) ID
case-folding issue doesn't apply
If NAMECASE GENERAL YES
is used, HTML5.1's rules for
Converting a string to uppercase
(and similar definitions for case-sensitive matching, which are
used in numerous further places throughout the specification),
mandate that only the alphabetic characters in IRV (US-ASCII)
are subject to case folding/canonicalization; hence HTML5.1's
basic naming rules can be precisely expressed in SGML (in fact,
HTML's restricted case folding can be seen as a consequence of
SGML's historic limitations in this respect).
In HTML5.1, permitted characters for constructs equivalent
or analogous to those mentioned above where SGML is using
name tokens are defined on a case-by-case basis. For this
analysis, HTML's element and attributes names and ID
and class
values are considered, and used as for all
other name tokens as well.
For element names, according to 8.1.2 Elements, the permitted characters are defined by the specification itself as the alphanumeric IRV (US-ASCII) characters:
HTML elements all have names that only use alphanumeric ASCII characters.
While decimal digits aren't actually used in any element defined in the specification (so could be left out from being allowed), this definition obviously isn't meant to constrain the elements admitted to occur in HTML in general, but only a statement about the elements defined by the specification itself.
The definition of what characters an element is allowed must change when additional vocabularies are used with HTML, and is also challenged by WHATWG's custom element specification specification, which, while not enjoying universal browser support, still is expected to become part of a future revision of W3C's HTML specification.
While the custom elements specification isn't included in W3C HTML5.1, it gives rise to an informed decision for a definition of a set of admitted characters in element names, which are given in the following EBNF productions:
PotentialCustomElement Name ::=
[a-z] (PCENChar)* '-' (PCENchar)*
PCENChar ::=
"-" | "." | [0-9] | "_" | [a-z] | #xB7 | [#xC0-#xD6] |
[#xD8-#xF6] | [#xF8-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
[#x203F-#x2040] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
[#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
[#x10000-#xEFFFF]
These ENBF productions are similar to those used by the The XML specification Version 1.0 Fifth ed., with the following modifications and additional constraints:
as opposed to XML, also the U+00B7 MIDDLE DOT
character is admitted
custom element names must contain the -
(U+002D HYPHEN-MINUS
) character
custom element names must be different from those SVG and MathML
element names containing -
(U+002D HYPHEN-MINUS
) as identified
in https://html.spec.whatwg.org/#valid-custom-element-name
,
ie. must be different from annotation-xml
, color-profile
, font-face
,
font-face-src
, font-face-uri
, font-face-format
, font-face-name
,
and missing-glyph
The requirement that custom elements must not contain uppercase
letters, and must begin with a lowercase letter seems odd,
considering that HTML wants to apply case-folding to an element
anyway, even to foreign elements. Its existence points to
an implementation detail in HTML parsers, eg. that scanning
for end-element tags in script data and other CDATA
-like
contexts is performed by string searching where parts of the
search string is formed from the lowercase letters of the element
name to search for. This is similar to what SGML parsers do
for determining the delimiter-in-context terminating the
content of elements with declared content CDATA
.
It can't be captured per se by an SGML declaration, which
always admits both the lowercase and uppercase letters; but it's
also not necessary do so, since it's enforced by there simply not
being an element declaration in the restrictive HTML5.1 DTD,
when NAMECASE GENERAL NO
is used.
Moreover, it's not possible to disallow the digit characters as name characters in SGML (but they're never allowed as name start characters).
The constraint with respect to conflicts with SVG and MathML elements is adequately represented in the restrictive HTML5.1 DTD and by the fact that declarations for these elements, when present, will prevent re-declaration attempts/name clashes.
An SGML declaration can't express the constraint related to presence
of at least one -
(U+002D HYPHEN-MINUS
) character but it's
also not necessary to do so since custom elements must (or should.
in the case of the permissive DTD) be declared eg. in the internal
subset or other custom declaration set. What can be achieved here is
to allow -
(U+002D HYPHEN-MINUS
) as name character but not
as name start character.
All said, the relevant portion of the SGML declaration for capturing HTML's naming rules as expressed by the custom elements specification looks as follows (where all character and character ranges are notated using decimal character numbers, and the SGML declaration makes use of the extended naming rules:
NAMING
LCNMSTRT ""
UCNMSTRT ""
NAMESTRT
46 -- FULL STOP --
95 -- LOW LINE --
183 -- MIDDLE DOT --
192-214 -- #xC0-#xD6 --
216-246 -- #xD8-#xF6 --
248-893 -- #xF8-#x37D --
895-8191 -- #x37F-#x1FFF --
8204-8205 -- #x200C-#x200D --
8255-8256 -- #x203F-#x2040 --
8304-8847 -- #x2070-#x328F --
11264-12271 -- #x2C00-#x2FEF --
12289-55295 -- #x3001-#xD7FF --
63744-64975 -- #xF900-#xFDCF --
65008-65533 -- #xFDF0-#xFFFD --
65536-983039 -- #x1000-#xEFFFF --
LCNMCHAR ""
UCNMCHAR ""
NAMECHAR
45 -- HYPHEN-MINUS --
Note this definition admits the 0+002D HYPHEN-MINUS
character
only as NAMECHAR
, ie. as the second or subsequent character of
name tokens, but not as their first character, as required by HTML's
naming rules.
Note also this declaration can be used with either
NAMECASE GENERAL YES
or NAMECASE GENERAL NO
.
HTML extends case-folding to foreign elements from SVG and MathML as well:
In the HTML syntax, tag names, even those for foreign elements, may be written with any mix of lower- and uppercase letters that, when converted to all-lowercase, matches the element's tag name; tag names are case-insensitive.
As far as this HTML5.1 DTD is concerned, this doesn't pose additional problems not already addressed.
To avoid awkward problems when transferring and roundtripping foreign content (which in many cases will be created using XML-based tools), however, web authors are advised to nevertheless use foreign element names in their native, case-aware form in HTML content, as well as in CSS and JavaScript DOM API selectors.
The HTML5.1 specification makes the following restriction with respect to custom data attributes:
that XML naming rules for custom data (specifically,
the name must not contain the :
(U+003A COLON) character,
must otherwise represent a valid XML name, and must
not begin with xml
(case-insensitively))
must begin with data-
and has at least one character
after the hyphen character.
The XML (Fifth Ed.) naming requirement is honored up to the differences already discussed above. Note that according to this definition, custom data attribute names must not contain eg. the MIDDLE DOT character, whereas custom element are allowed to contain it, but this isn't represented in the SGML declaration (and is considered accidental).
The other requirements can be honored when using the Restrictive DTD eg. by declaring custom data attribute in the internal subset.
W3C's HTML5.1 specification says with respect to the ID attribute.:
The value must be unique amongst all the IDs in the element's home subtree and must contain at least one character. The value must not contain any space characters.
and
There are no other restrictions on what form an ID can take; in particular, IDs can consist of just digits, start with a digit, start with an underscore, consist of just punctuation, etc.
According to this definition in isolation, the following examples are valid uses of ID values (however, might be constrained by other rules for eg. parsing attributes and by semantic rules):
<div id='>
<div id=">
<div id==>
<div id=?>
<div id=" ">
<div id=1>
<div id=#>
<div id=top>
<div id=<>
<div id=&>
<div id=/>
<div id=:(>
From these examples it's clear that HTML's choice of allowable ID values is unlikely to help with interoperability within a larger markup context.
While the following characters aren't formally invalid, since ID values are most commonly used
as href
targets in anchors (using fragment identifiers
rather than ID value references)
for practical considerations (ie. avoiding having to use entity references), attribute values should generally not contain the following characters:
Moreover, though not invalid, ID values can be expected
to rarely contain the #
(U+0023 NUMBER SIGN) character,
since it's used to separate a fragment identifier from its
preceding part in a URL.
Since it's then necessary to restrict the set of expected characters in name tokens into something more representative of actual HTML usage, in order to avoid arbitrary choices, usage constraints mandated by CSS selector and URL syntax as discussed below are natural candidates to base an informed choice with respect to naming on. As it turns out, however, the rules for name tokens in HTML4 already cover the set of reasonably usable characters pretty well, whereas HTML5/HTML5.1's liberal rules are easily seen as less recommendable.
As the least arbitrary/suprising thing to do, it seems reasonable to resort to HTML4's naming rules which are also generally still recommended for new content by other resources. For example, the classic restriction on HTML ID values and class names, as eg. stated on MDN's article on the id attribute is:
Note: Using characters except ASCII letters and digits, '_', '-', and '.' may cause compatibility problems [...]
Moreover, ID values starting with a digit should be avoided when it's desired to target these using CSS selectors.
This is because CSS selectors can target these values in ID and class selectors only with escaping, which wasn't supported in a portable way in browsers using earlier CSS revisions, or not at all.
From Selectors Level 2.1:
In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (
-
) and the underscore (_
); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code [...]. For instance, the identifier"B&W?"
may be written as"B\&W\?"
or"B\26 W\3F"
.
According to HTML 5.1: 2. Common infrastructure, HTML's space characters are the U+0020 SPACE, U+0009 CHARACTER TABULATION, U+000A LINE FEED, U+000C FORM FEED, and U+000D CARRIAGE RETURN characters (rather than the larger set of Unicode spaces).
By these definition, we can leave the naming parameters precisely as above. Note it isn't surprising that we recover HTML4's rules here, but it is mildly surprising still that for the definition of custom elements rather conservative (compared to the rules for ID values) rules appealing to restrictions of SGML name tokens were used.
These constraints still hold in current CSS specifications as shown next.
From Selectors Level 3 (which is also identical to the relevant part of Cascading Style Sheets Level 2 (CSS 2.2) Specification Working Draft:
ident [-]?{nmstart}{nmchar}*
name {nmchar}+
nmstart [_a-z]|{nonascii}|{escape}
nonascii [^\0-\177]
unicode \\[0-9a-f]{1,6}(\r\n|[ \n\r\t\f])?
escape {unicode}|\\[^\n\r\f0-9a-f]
nmchar [_a-z0-9-]|{nonascii}|{escape}
This definition is still identical to that used in the older CSS specification (see above), eg. by this definition
the set of usable start characters in IRV/US-ASCII are
the letters and _
a name token starting with a double -
(0+002B HYPHEN-MINUS)
characters isn't allowed, nor is a name token starting with
-
(0+002B HYPHEN MINUS) followed by a digit.
Allowing a larger set of characters needs to be weighed against rules for URI fragment and CSS selector parsing.
That is, since ID values are used to form fragment identifiers in URLs, surely a recommendation about which characters an ID value should or shouldn't contain must look at the characters permitted in URL fragments for an informed choice, since characters not in this set need URL escaping (eg. percent-encoding).
From RFC3987:
fragment = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreseserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The earlier RFC2396 specification also reduces the set further by recommending to avoid the following unwise characters anywhere in an URL:
unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
Applying these recommendations reduces the set of usable characters in fragment identifiers to these:
/ ? (fragment production)
: @ (pchar production)
- . _ ~ (unreserved production)
! $ & ( ) * + , ; = (subdelims production)
From these, the following are disallowed because of conflicts with markup parsing rules:
/
(SOLIDUS)
?
(QUESTION MARK)
!
(EXCLAMANTION MARK)
&
(AMPERSAND)
;
(SEMICOLON)
Moreover, the following characters should be disallowed because of conflicts with selector syntax
#
(NUMBER SIGN)
.
(FULL STOP)
@
(AT-SIGN)
+
(PLUS), >
(GREATER-THAN), ',' (COMMA), ~
(TILDE)
disallowed anywhere, since used as CSS combinators, and ~
is also used in the ~=
attribute selector such that
an attribute name needs escaping when it contains ~
*
(ASTERISK)
disallowed anywhere, since used as CSS wildcard, and
also used in the *=
attribute selector such that an
attribute name needs escaping when it contains *
:
(COLON)
arriving at the following characters
-
(HYPHEN-MINUS), _
(LOW LINE)
and
!
(EXCLAMATION MARK), @
(AT-SIGN)
Note the .
(FULL STOP) character is excluded by
these rules even though allowed as namechar of element
names above.
It's unwise to alter HTML's name start characters because HTML parsers could depend on it being in the fixed range of lowercase letters (see above).
This leaves !
and @
as additional characters for
HTML name characters but not name start characters.
This is considered too small a benefit to warrant
changing HTML's naming rules at all.
The HTML5 and HTML5.1 specifications names 2231 Named character references available for use in HTML5/HTML5.1, and lists to the following public identifiers as reference:
-//W3C//DTD XHTML 1.0 Transitional//EN
-//W3C//DTD XHTML 1.1//EN
-//W3C//DTD XHTML 1.0 Strict//EN
-//W3C//DTD XHTML 1.0 Frameset//EN
-//W3C//DTD XHTML Basic 1.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
-//W3C//DTD MathML 2.0//EN
-//WAPFORUM//DTD XHTML Mobile 1.0//EN
Note that the link provided by the HTML5.1 specification
https://www.w3.org/TR/2016/REC-html51-20161101/entities.dtd
to point to the consolidated SGML/XML entity set covering
the reference sources is broken, but for the purpose of
constructing the restrictive DTD is replaced with the
canonical source for these entity sets at
https://www.w3.org/2003/entities/2007/htmlmathml-f.ent
anyway
(referred to as htmlmathml-f.ent
in subsequent discussion)
as advised in https://www.w3.org/TR/xml-entity-names/
.
While the vast majority of these entities is provided for MathML support, the entity set is part of HTML5/HTML5.1 and thus available anywhere in HTML content.
The Permissive HTML5.1 DTD doesn't include these declarations as entity set; instead, named entity references are provided using WebSGMLs (ISO 8879 Annex K) "predefined data character entities" feature.
Using predefined entities can capture the notion that a web browser has built-in support for displaying HTML5's named character references, and that at no point in the process are predefined data character entities actually substituted into numeric character entity references.
The latter notion is required since WebSGML predefined data
character entities can only map to a single character number
(UCS code point), rather than a sequence of code points; for
example, sgmljs.net SGML, like (Open)SP SGML's osgmlnorm
program, will reproduce predefined data character entities as-is
(ie. as entity reference, rather than replaced character(s))
to result markup when used as a command-line application with
the proper target SGML declaration options.
Even though the actual replacement character number to which a predefined entity is mapped is thus inconsequential for SGML processors, those entities mapping to multi-code point sequences aren't included in the predefined character entities because predefined entity reference can't be redefined; if their use is desired, these can simply be included in the internal subset of a document by using eg.
<!ENTITY % htmlmathml-f PUBLIC
"-//W3C//ENTITIES HTML MathML Set//EN//XML"
"http://www.w3.org/2003/entities/2007/htmlmathml-f.ent">
%htmlmathml-f;
as advised in https://www.w3.org/TR/xml-entity-names/
.
The affected entities are listed below, along with a recommendation as to their replacement
Of those listed by the HTML5 specification, the following entities
(listed along with their base code points) are combined with
U+FE00 VARIATION SELECTOR-1
into variation sequences
caps
(U+2229 INTERSECTION
)
cups
(U+222A UNION
)
gvertneqq
/gvnE
(U-2269 GREATER-THAN BUT NOT EQUAL TO
)
lates
(U+2AAD LARGER THAN OR EQUAL TO
)
lesg
(U+22DA LESS-THAN EQUAL TO OR GREATER-THAN
)
lvertneqq
/lvnE
(U+2268 LESS-THAN BUT NOT EQUAL TO
)
smtes
(U+2AAC SMALLER THAN OR EQUAL TO
)
sqcaps
(U+2293 SQUARE CAP
)
sqcups
(U+2294 SQUARE CUP
)
varsubsetneq
, vsubne
(U+228A SUBSET OF WITH NOT EQUAL TO
)
varsubsetneqq
, vsubnE
(U+2ACB SUBSET OF ABOVE NOT EQUAL TO
)
varsupsetneq
, vsubpne
(U+228B SUPERSET OF WITH NOT EQUAL TO
)
varsupsetneqq
, vsubnE
(U+2ACC SUPERSET OF ABOVE WITH NOT EQUAL TO
)
Note all used variation sequences in htmlmathml-f.ent
are
standardized variants,
To be able to use references to these, simply use their
respective base code point or base entity, respectively.
For example, caps
(U+2229 INTERSECTION
, U+FE00 VARIATION SELECTOR-1
)
supposed to render INTERSECTION with serifs
should be replaced by
just cap
(U+2229 INTERSECTION
); using the respective base
character is also the recommended practice in the
Unicode Variation Seqeuences FAQ.
Moreover, the following combining marks are used together with
other characters (eg. as in race
which maps to the sequence
U+223D REVERSED TILDE
, U+0331 COMBINING MACRON BELOW
):
U+0333 COMBINING DOUBLE LOW LINE
U+20E5 COMBINING REVERSE SOLIDUS OVERLAY
U+0338 COMBINING LONG SOLIDUS OVERLAY
U+20D2 COMBINING LONG VERTICAL LINE OVERLAY
U+0331 COMBINING MACRON BELOW
in the following entities:
acE
: U+223E
, U+0333
bne
: U+003D
, U+20E5
bnequiv
: U+2261
, U+20E5
nang
: U+2220
, U+20D2
nbump
: U+224E
, U+0338
nbumpe
: U+224F
, U+0338
nconqdot
: U+2A6D
, U+0338
nedot
: U+2250
, U+0338
nesim
: U+2242
, U+0038
ngE
, ngeqq
: U+2267
, U+0038
ngeqslant
, nges
: U+247E
, U+0038
nGt
: U+226B
, U+20D2
nGtv
: U+226B
, U+0338
nlE
, nleqq
: U+2266
, U+0338
nleqslant
, nles
: U+2A7D
, U+0338
nLt
: U+226A
, U+20D2
nLtv
: U+226A
U+0338
NotEqualTilde
: U+2242
, U+0338
NotGreaterFullEqual
: U+2267
, U+0338
NotGreaterGreater
: U+2269
, U+0338
NotGreaterSlantEqual
: U+2A7E
, U+0338
NotHumpDownHump
: U+224E
, U+0338
NotHumpEqual
: U+224F
, U+0338
notindot
: U+22F5
, U+0338
notinE
: U+22F9
, U+0388
NotLeftTriangleBar
: U+29CF
, U+0388
NotLessLess
: U+226A
, U+0388
NotLessSlantEqual
: U+2A7D
, U+0388
NotNestedGreaterGreater
: U+2AA2
, U+0388
NotNestedLessLess
: U+2AA1
, U+0388
NotPrecedesEqual
, npreceq
: U+2AAF
, U+0388
NotRightTriangleBar
: U+29D0
, U+0388
NotSquareSubset
: U+228F
, U+0388
NotSquareSuperset
: U+2290
, U+0388
NotSubset
, nsubset
: U+2282
, U+0388
NotSucceedsEqual
, nsucceq
: U+2AB0
, U+0388
NotSucceedsTilde
: U+2AB0
, U+0388
NotSuperset
: U+2283
, U+0388
nparsl
, npre
: U+2AFD
, U+20E5
npart
: U+2202
, U+0338
nrarrc
: U+2993
, U+0338
nrarrw
: U+219D
, U+0338
nsce
: U+2AB0
, U+0338
nsubE
, nsubseteqq
: U+2AC5
, U+0338
nsupE
, nsupseteqq
: U+2AC6
, U+0338
nsupset
: U+2283
U+20D2
nvap
: U+224D
, U+20D2
nvge
: U+2265
, U+20D2
nvgt
: U+003E
, U+20D2
nvle
: U+2264
, U+20D2
nvlt
: U+003C
, U+20D2
nvltrie
: U+22B4
, U+20D2
nvrtrie
: U+22B5
, U+20D2
nvsum
: U+223C
, U+20D2
race
: U+223D
, U+0331
vnsub
: U+2282
, U+20D2
vnsup
: U+2283
, U+20D2
For the two most commonly used combinations, it is recommended to use the following replacements instead:
bne
(U+003D EQUAL TO
, U+20E5 COMBINING REVERSE SOLIDUS OVERLAY
):
use the equiv
or Congruent
entity instead
bnequiv
(U+2261 IDENTICAL TO
, U+20E5 COMBINING REVERSE SOLIDUS OVERLAY
):
use the nequiv
or NotCongruent
entity instead
Other entities making use of combining marks are not represented in the set of predefined entities in the HTML5.1 SGML declaration for the Permissive DTD.
Apart from those, the following HTML5 named character references are mapped to text elements with more than a single code point:
fjlig
(U+0066 LATIN SMALL LETTER F
, U+006A LATIN SMALL LETTER J
)
as an f-j ligature is missing from Unicode,
supposedly, fjlig
is used as a placeholder for authors until it does;
but this doesn't appear to be happening, which is why it's not
represented in the set of predefined entities (note that rendering a
ligature is mostly performed by the used font anyway whenever the
code point sequence appears in text data)
thickspace
(U+205F MEDIUM MATHEMATICAL SPACE
, U+200A HAIR SPACE
)
use emsp13
instead (emsp13
is mapped to U+2006 THREE-PER-EM SPACE
as replacement, and, according to
Unicode spaces,
medium mathematical space is 4/18 em = 1/4.5em, hence
medium mathematical space + hair space = 4/18 em + 2/18 em = 6/18 em = 1/3 em)
Removing the point sequences from the set of predefined entities allows us to specify UCS implementation level 1, provided no other multi-code point sequences are used in user data.
Note XML Entity Definitions for Characters (3rd Edition) contains a similar, but not identical, analysis.
For the SGML declaration for HTML5.1, WebSGML's
FEATURES MINIMIZE EMPTYNRM YES
setting is used.
This allows
bogus XML-style empty elements for eg. HTML's meta
elements and other HTML "void" elements, and thus matches
HTML's prescribed parsing exactly
XML-style empty elements in foreign content on
elements having declared content EMPTY
(in addition
to all other elements).
HTML5/HTML5.1 allows omitting quote characters
(called the LIT
or LITA
delimiters in SGML parlance,
which are the double or single quote characters in the
reference concrete syntax, resp.) on any attribute where
the attribute value happens to not contain space characters,
not just the boolean attributes.
HTML's rules match, and are inherited from, SGML's
where in addition, the behaviour is further subject
to the FEATURES MINIMIZE ATTRIB OMITNAME
setting of
the SGML declaration, which, only when YES
, allow
this on undeclared attributes.
For the Restrictive DTD, this setting is NO
;
for the Permissive DTD, this setting is YES
.
Omitting quotes on attributes other than the Boolean attributes and other enumerated attributes wasn't very common in HTML until relatively recently and was discouraged in earlier specifications; the latest WHATWG specification text, however, makes aggressive use of it.
The Custom elements specification, while not part of W3C HTML5.1, was already considered when defining the set of allowable characters in element names and other name tokens; the information provided here gives a general statement on their intended use with the Restrictive DTD, but isn't meant as a comprehensive treatment of the subject.
For use with the Restrictive DTD for HTML5.1, the recommended way to handle custom elements is to declare these in the internal subset (a declaration for these must be present when using the Restrictive DTD but doesn't need to be when using the Permissive DTD).
To be able to actually use custom elements in browsers, these must be registered using JavaScript, but this is considered out of scope here.
Note that apart from custom elements, there are also
customized builtin elements. For these, on the HTML/markup
side of things, not much has to be done as behaviours
with respect to tag omission and omission of attribute
names are already declared. New attributes can be declared
as needed using sgmljs.net SGML; (Open)SP
SGML, on the other hand, doesn't allow multiple ATTLIST
declarations for the same element, so for (Open)SP SGML,
custom attributes must be manually added to the respective
'ATTLIST` declaration in the Restrictive HTML5.1 DTD.
On the other hand, custom elements that aren't
replacements for built-in elements, since required to
support global attributes including their special parsing
rules just like regular HTML elements (see eg.
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes#id
),
are expected to receive attribute declarations for
global attributes via WebSGML ATTLIST #ALL
declarations.
While these are already used for the Permissive, but not
Restrictive DTD yet, it is thus expected that a future
revision of this DTD (for eg. HTML5.2, which has been
announced for 4Q2017) will be published without support
for (Open)SP SGML or other processors lacking ATTLIST #ALL
support (note the Permissive DTD already requires
ATTLIST #ALL
all in this revision).
The W3C document at
https://www.w3.org/International/questions/qa-html-encoding-declarations
recommends to always declare the encoding of a document
using a meta
element with either the charset
or
the http-equiv
(for HTML4) attribute when using UTF-8.
However, this is not considered a DTD issue and isn't enforced by this DTD. It can't, because SGML won't infer whole elements (it will only infer start- and end-element tags), and it shouldn't, because there are of course legitimate reasons to use a different encoding, even though UTF-8 is the preferred one.
Uniform header values and other global site or route defaults can be implemented using SGML templates, however.