Markup minimization
Rich entity and notation content transclusion
Custom Wiki syntaxes
Alternate concrete markup syntaxes/SGML declaration
Metadata and processing facilities (SGML LINK
)
Concurrent Markup (SGML CONCUR
)
Tag omission/inference
Empty elements
Short syntax for enumerated attributes
Unquoted attributes
Content exceptions
www.w3.org/TR/NOTE-sgml-xml-971215.htmlA valid HTML document
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
How is it parsed by SGML?
<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting;)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>
O O
(double capital letter O for "omission")
the html
, head
, and body
elements admit
start- and end-tag omission
<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting;)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>
O -
the p
element admits end-tag omission only
<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>
ANY -(p)
p
admits any element except p
anywhere as
content
<!DOCTYPE html SYSTEM "html52mini.dtd">
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
SGML creates an html
element if it isn't there, knowing that an
html
document element must be the first content element in
an HTML file
<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
SGML infers the head
element if it isn't there, since the
content model requires it at the start of html
's content
<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
SGML accepts the title
element as child content of head
,
as allowed by head
s model group expression
<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<p>This is the first paragraph.
<p>This is the second.
SGML infers the end-element tag for the head
element
since the p
element following title
isn't allowed to
occur in head
<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.
<p>This is the second.
SGML infers the body
element if it isn't there, since it's
required to follow the head
element
<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.
<p>This is the second.
SGML accepts the first p
element as content of body
<!DOCTYPE html SYSTEM "html52mini.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.</p>
<p>This is the second.
SGML infers the end-element tag for p
, since p
isn't
accepted as content of p
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.</p>
<p>This is the second.</p>
</body>
</html>
SGML infers the end-element tags for p
, body
, and html
at the
end of the document
Given the following attribute declaration
<!ATTLIST option selected (selected) #IMPLIED>
these element/attribute specifications are equivalent:
<option selected>
<option selected=selected>
<option selected="selected">
These rules happen to coincide (mostly) with HTML's attribute minimization features
A restrictive DTD for parsing HTML containing declarations for all elements of HTML 5.2
designed to be used along with
using their official DTDs
www.w3.org/TR/html5/single-page.html[...] Since neither of the two authoring formats defined in this specification are applications of SGML, a validating SGML system cannot constitute a conformance checker [...].
The HTML 5.x markup language is presented twice:
as grammar rules (chapter 3 and 4)
as parsing algorithm (chapter 8.2)
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
<!ENTITY % sectioning
"article|aside|nav|section">
Transcription into parameter entity declarations
<!-- Heading content (section 3.2.4.2.4). -->
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
<!-- Sectioning content (section 3.2.4.2.3). -->
<!ENTITY % sectioning "article|aside|nav|section">
<!-- Metadata content (section 3.2.4.2.1). -->
<!ENTITY % metadata
"base|link|meta|noscript|script|
style|template|title">
spacer
The HTML 4.01 DTD contains this declaration:
<!ENTITY % flow "%block; | %inline;">
But HTML 5 only has definitions for flow and phrasing content.
In HTML 5, a definition for block content is obtained by subtracting phrasing content from flow content
<!-- Flow elements except phrasing elements. -->
<!ENTITY % flow_only
"address|article|aside|blockquote|details|
div|dl|fieldset|figure|footer|form|
h1|h2|h3|h4|h5|h6|header|hr|main|menu|nav|
ol|p|pre|section|table|ul">
P
element
Content: Phrasing content.
A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.
P
Content: Phrasing content.
is translated to this element declaration
<!ELEMENT p (#PCDATA|%phrasing;)*>
Note Text (character data content) is also phrasing content.
P
A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.
P
A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul,element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.
P
Complete DTD declarations for the p element
<!ENTITY % phrasing "a|abbr|area|...">
<!ENTITY % block "address|article|...">
<!ELEMENT p - O
(#PCDATA|%phrasing;)* -(%flow_only;|figcaption)>
<!-- The html element (section 4.1).
Content: A head element followed by a body element.
Tag omission: An html element's start tag can be omitted if the first thing
inside the html element is not a comment. An html element's end tag can be
omitted if the html element is not immediately followed by a comment. -->
<!ELEMENT html O O (head,body) +(script)>
<!ATTLIST html
%extensionattrs;
accesskey NMTOKENS #IMPLIED
class NMTOKENS #IMPLIED
contenteditable (true|false) #IMPLIED
contextmenu IDREF #IMPLIED
sgmljs.net/docs/html52.html
Derived from the (full) HTML 5.2 DTD by including only element and attribute declarations that make parsing HTML special
Assumes semantics of WebSGML SGML
declaration setting IMPLYDEF ELEMENT ANYOTHER
Bundled in sgmljs.net SGML and is
resolved via about:legacy-compat
Permissive DTD | Character Set, Names |
ARIA + RDF/A (tbd) | Transparent content |
Tag omission | Character references |
Boolean Attributes | XML empty elements |
Void elements | Unquoted attributes |
Self-closing elements | RAWTEXT and RCDATA |
Script data | Foreign elements |
Custom elements | Custom attributes |
Lexical types (WebSGML atttribute data specifications) for URI and datetime attributes
Variant types (where an attribute determines the content model and/or type of other attributes)
HTML download
(and possibly alt
)
attribute(s) parsed both as
enumerated and CDATA/URI attribute
(according to WHATWG HTML)
Basis for https://w3c-test.org/html/
(test suite
normatively referenced from W3C's HTML 5.2 spec),
differing from it only in the normative test suite
being prepared for running on web browsers
Doesn't contain tests specifically updated for HTML 5.2; also contains some legacy (HTML 4 and even 3) tests
w3c-test.org/html/contains tests targetting the procedural rather than declarative formulation of HTML parsing
decisions have to be made with respect to what what constitutes rejected versus accepted tests (eg. tests always succeed)
sgmlproc
, with current html52mini.dtd,
restricted to relevant test cases, succeeds
in parsing
(amounting to 2.69% parsing failures)
Results were obtained by
running tests in tests*.dat
files (separated into individual files)
as of a 2017 snapshot of html5lib-tests,
with adding a DOCTYPE where not present, and
by ignoring tests lacking a content body, having trivially invalid head elements (missing title), and legacy frameset elements
Using sgmljs.net SGML
$ cat test.html
<!DOCTYPE html SYSTEM "html51.dtd">
<title>Test</title>
<body>
<ol hidden reverse>
<li>One</li>
<li>Two</li> </ol>
$ sgmlproc test.html
"testfile.html": line 4: fatal: 'reverse':
token not in token group for any attributes
Using sgmljs.net SGML
$ sed s/reverse/reversed/ test.html > test2.html
$ sgmlproc test2.html
<html>
<head><title>Test</title></head>
<body>
<ol hidden="HIDDEN" reversed="REVERSED">
<li>One</li>
<li>Two</li> </ol>
</body></link></head></html>
$
Using OpenSP SGML
$ osgmlnorm test2.html
<HTML>
<HEAD>
<TITLE>Test</TITLE>
<LINK HREF="style.css" REL="STYLESHEET">
</HEAD>
<BODY><OL HIDDEN REVERSED>
<LI>One</LI>
<LI>Two</LI>
</OL></BODY></HTML>
datalist
issue
$ head -1 test3.html
<!DOCTYPE html SYSTEM "html51e.dtd">
$ grep 'ELEMENT datalist' html51e.dtd
<!ELEMENT datalist - -
((#PCDATA|%phrasing;)*|(option|%scripting;)*)
-(%flow_only;)>
$ osgmlnorm test3.html
content model is ambiguous:
when no tokens have been matched, both the 1st
and 2nd occurrences of "TEMPLATE" are possible
sgmljs.net/docs/html5.html#the-datalist-element
The HTML 5.2 spec text for element P hasn't changed from HTML 5.1 when it should have evolved along with element categories
Eg. the set of elements on which a P element
is closed should include the new dialog
element
listed as member of the flow
category
the keygen
element (an element having void content
eg. EMPTY
content) in HTML 5.1
isn't included in HTML 5.2 which would make
parsers fail hard
legacy HTML 3 and 4 elements bgsound
, font
,
basefont
, etc. are still covered in the html5lib-tests
<img href="..." alt>
<a href="..." download>
W3C's web-platform-tests suite makes "creative"
use of the alt
attribute; similarly,
W3C's HTML 5.2 spec (as opposed to WHATWG's)
doesn't specify rules for download
name
token/attribute parsing