Talk overview

SGML refresher

The HTML 5.2 DTD

Applications

SGML features not in XML

Markup minimization

Rich entity and notation content transclusion

Custom Wiki syntaxes

Alternate concrete markup syntaxes/SGML declaration

Metadata and processing facilities (SGML LINK)

Concurrent Markup (SGML CONCUR)

HTML features not in XML

Markup minimization

Tag omission/inference

Empty elements

Short syntax for enumerated attributes

Unquoted attributes

Content exceptions

www.w3.org/TR/NOTE-sgml-xml-971215.html

Tag omission

A valid HTML document

<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.

How is it parsed by SGML?

A minimal DTD for HTML

Tag omission indicators

<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting;)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>

O O (double capital letter O for "omission"): the html, head, and body elements admit start- and end-tag omission

A minimal DTD for HTML

Tag omission indicators

<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting;)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>

O -: the p element admits end-tag omission only

A minimal DTD for HTML

Content exceptions

<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>

ANY -(p): p admits any element except p anywhere as content

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.

SGML creates an html element if it isn't there, knowing that an html document element must be the first content element in an HTML file

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.

SGML infers the head element if it isn't there, since the content model requires it at the start of html's content

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.

SGML accepts the title element as child content of head, as allowed by heads model group expression

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<p>This is the first paragraph.
<p>This is the second.

SGML infers the end-element tag for the head element since the p element following title isn't allowed to occur in head

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.
<p>This is the second.

SGML infers the body element if it isn't there, since it's required to follow the head element

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.
<p>This is the second.

SGML accepts the first p element as content of body

Parsing HTML

<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.</p>
<p>This is the second.

SGML infers the end-element tag for p, since p isn't accepted as content of p

Parsing HTML

<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.</p>
<p>This is the second.</p>
</body>
</html>

SGML infers the end-element tags for p, body, and html at the end of the document

Attribute minimization

Given the following attribute declaration

<!ATTLIST option selected (selected) #IMPLIED>

these element/attribute specifications are equivalent:

<option selected>
<option selected=selected>
<option selected="selected">

These rules happen to coincide (mostly) with HTML's attribute minimization features

The HTML 5.2 DTD

A restrictive DTD for parsing HTML containing declarations for all elements of HTML 5.2

designed to be used along with

SVG
MathML, and
the ARIA attributes

using their official DTDs

HTML 5 as SGML profile

[...] Since neither of the two authoring formats defined in this specification are applications of SGML, a validating SGML system cannot constitute a conformance checker [...].

www.w3.org/TR/html5/single-page.html

HTML 5 language

The HTML 5.x markup language is presented twice:

declarative specification: as grammar rules (chapter 3 and 4)

procedural specification: as parsing algorithm (chapter 8.2)

HTML 5 vocabulary

www.w3.org/TR/html/dom.html#kinds-of-content

HTML 5 vocabulary

<!ENTITY % heading "h1|h2|h3|h4|h5|h6">

HTML 5 vocabulary

<!ENTITY % sectioning
      "article|aside|nav|section">

HTML element categories

Transcription into parameter entity declarations

<!-- Heading content (section 3.2.4.2.4). -->
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">

<!-- Sectioning content (section 3.2.4.2.3). -->
<!ENTITY % sectioning "article|aside|nav|section">

<!-- Metadata content (section 3.2.4.2.1). -->
<!ENTITY % metadata
"base|link|meta|noscript|script|
 style|template|title">

Flow and phrasing content

spacer

The HTML 4.01 DTD contains this declaration:

 <!ENTITY % flow "%block; | %inline;">

But HTML 5 only has definitions for flow and phrasing content.

Flow and phrasing content

In HTML 5, a definition for block content is obtained by subtracting phrasing content from flow content

<!-- Flow elements except phrasing elements. -->
<!ENTITY % flow_only
"address|article|aside|blockquote|details|
 div|dl|fieldset|figure|footer|form|
 h1|h2|h3|h4|h5|h6|header|hr|main|menu|nav|
 ol|p|pre|section|table|ul">

The `P` element

Content: Phrasing content.

A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.

Content model for `P`

Content: Phrasing content.

is translated to this element declaration

    <!ELEMENT p (#PCDATA|%phrasing;)*>

Note Text (character data content) is also phrasing content.

Tag omission rules for `P`

A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.

Tag omission rules for `P`

A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul,element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.

Complete declaration for `P`

Complete DTD declarations for the p element

<!ENTITY % phrasing "a|abbr|area|...">

<!ENTITY % block "address|article|...">

<!ELEMENT p - O
(#PCDATA|%phrasing;)* -(%flow_only;|figcaption)>

Complete DTD

<!-- The html element (section 4.1).
 Content: A head element followed by a body element.
 Tag omission: An html element's start tag can be omitted if the first thing
 inside the html element is not a comment. An html element's end tag can be
 omitted if the html element is not immediately followed by a comment. -->
<!ELEMENT html O O (head,body) +(script)>
<!ATTLIST html
    %extensionattrs;
    accesskey NMTOKENS #IMPLIED
    class NMTOKENS #IMPLIED
    contenteditable (true|false) #IMPLIED
    contextmenu IDREF #IMPLIED

sgmljs.net/docs/html52.html

The HTML 5.2 Mini-DTD

Derived from the (full) HTML 5.2 DTD by including only element and attribute declarations that make parsing HTML special

Assumes semantics of WebSGML SGML declaration setting IMPLYDEF ELEMENT ANYOTHER

Bundled in sgmljs.net SGML and is resolved via about:legacy-compat

Full paper

Permissive DTD	Character Set, Names
ARIA + RDF/A (tbd)	Transparent content
Tag omission	Character references
Boolean Attributes	XML empty elements
Void elements	Unquoted attributes
Self-closing elements	RAWTEXT and RCDATA
Script data	Foreign elements
Custom elements	Custom attributes

Additional topics

Lexical types (WebSGML atttribute data specifications) for URI and datetime attributes

Variant types (where an attribute determines the content model and/or type of other attributes)

HTML download (and possibly alt) attribute(s) parsed both as enumerated and CDATA/URI attribute (according to WHATWG HTML)

The html5lib-tests suite

Normatively referenced

Basis for https://w3c-test.org/html/ (test suite normatively referenced from W3C's HTML 5.2 spec), differing from it only in the normative test suite being prepared for running on web browsers

Doesn't contain tests specifically updated for HTML 5.2; also contains some legacy (HTML 4 and even 3) tests

w3c-test.org/html/

The html5lib-tests suite

Component test suite for html5lib

contains tests targetting the procedural rather than declarative formulation of HTML parsing

decisions have to be made with respect to what what constitutes rejected versus accepted tests (eg. tests always succeed)

The html5lib-tests suite

Results

sgmlproc, with current html52mini.dtd, restricted to relevant test cases, succeeds in parsing

942 of 966, or 97.31% of the html52lib tests suite

(amounting to 2.69% parsing failures)

The html5lib-tests suite

Results were obtained by

running tests in tests*.dat files (separated into individual files)

as of a 2017 snapshot of html5lib-tests,

with adding a DOCTYPE where not present, and

by ignoring tests lacking a content body, having trivially invalid head elements (missing title), and legacy frameset elements

Applications

Parsing HTML

Checking HTML

Scraping HTML

Web vocabulary evolution

Producing HTML

Checking HTML

Using sgmljs.net SGML

$ cat test.html
<!DOCTYPE html SYSTEM "html51.dtd">
<title>Test</title>
<body>
<ol hidden reverse>
	<li>One</li>
	<li>Two</li> </ol>
$ sgmlproc test.html
"testfile.html": line 4: fatal: 'reverse':
token not in token group for any attributes

Checking HTML

Using sgmljs.net SGML

$ sed s/reverse/reversed/ test.html > test2.html
$ sgmlproc test2.html
<html>
<head><title>Test</title></head>
<body>
<ol hidden="HIDDEN" reversed="REVERSED">
	<li>One</li>
	<li>Two</li> </ol>
</body></link></head></html>
$

Checking HTML

Using OpenSP SGML

$ osgmlnorm test2.html
<HTML>
<HEAD>
<TITLE>Test</TITLE>
<LINK HREF="style.css" REL="STYLESHEET">
</HEAD>
<BODY><OL HIDDEN REVERSED>
<LI>One</LI>
<LI>Two</LI>
</OL></BODY></HTML>

Web vocabulary evolution

Flaw in HTML specification text

github.com/whatwg/html/commit/6e305c457e42276bf275b8432302a32c929b0eb8

Web vocabulary evolution

HTML 5.1's `datalist` issue

$ head -1 test3.html
<!DOCTYPE html SYSTEM "html51e.dtd">
$ grep 'ELEMENT datalist' html51e.dtd
<!ELEMENT datalist - -
((#PCDATA|%phrasing;)*|(option|%scripting;)*)
-(%flow_only;)>
$ osgmlnorm test3.html
content model is ambiguous:
when no tokens have been matched, both the 1st
and 2nd occurrences of "TEMPLATE" are possible

sgmljs.net/docs/html5.html#the-datalist-element

Web vocabulary evolution

HTML 5.2 spec inconsistency

The HTML 5.2 spec text for element P hasn't changed from HTML 5.1 when it should have evolved along with element categories

Eg. the set of elements on which a P element is closed should include the new dialog element listed as member of the flow category

Web vocabulary evolution

Missing legacy elements

the keygen element (an element having void content eg. EMPTY content) in HTML 5.1 isn't included in HTML 5.2 which would make parsers fail hard

legacy HTML 3 and 4 elements bgsound, font, basefont, etc. are still covered in the html5lib-tests

Web vocabulary evolution

Unsound attribute minimization

<img href="..." alt>
<a href="..." download>

W3C's web-platform-tests suite makes "creative" use of the alt attribute; similarly, W3C's HTML 5.2 spec (as opposed to WHATWG's) doesn't specify rules for download name token/attribute parsing