1986 | ISO/IEC 8879 (*SGML*) publication |
1998 | XML 1.0 specification as simplified profile of SGML, along with ISO/IEC 8879 Annex K (WebSGML, DTD-less SGML) |
1999 | W3C HTML 4.01 recommendation |
2016 | W3C HTML 5.1 recommendation |
XML is per definition a profile of WebSGML
Features not in the XML profile
Markup minimization
Rich entity and notation content transclusion
Custom Wiki syntaxes
Alternate concrete markup syntaxes/SGML declaration
Metadata and processing facilities (SGML LINK
)
Concurrent Markup (SGML CONCUR
)
HTML is per definition not a profile of SGML
www.w3.org/TR/html5/single-page.html[...] Since neither of the two authoring formats defined in this specification are applications of SGML, a validating SGML system cannot constitute a conformance checker [...].
HTML features not present in SGML
modelled via SGML content exclusions
considered an unenforced specification
even the specification text markup itself is invalid
ID
/HREF
values, inline JavaScript
This talk: HTML 5 is considered a profile of SGML
This talk: HTML features not used in XML
Tag omission/inference
Empty elements
Short syntax for enumerated attributes
Unquoted attributes
Content exceptions
www.w3.org/TR/NOTE-sgml-xml-971215.htmlA valid HTML document
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
How is it parsed by SGML?
<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting;)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>
O O
(double capital letter O for "omission")
the html
, head
, and body
elements admit
start- and end-tag omission
<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting;)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>
O -
the p
element admits end-tag omission only
<!ENTITY % metadata "title|script">
<!ENTITY % scripting "script|template">
<!ELEMENT html O O (head,body) +(%scripting)>
<!ELEMENT head O O (%metadata;)*>
<!ELEMENT body O O ANY>
<!ELEMENT p - O ANY -(p)>
ANY -(p)
p
admits any element except p
anywhere as
content
<!DOCTYPE html SYSTEM "html-minimal.dtd">
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
<!DOCTYPE html SYSTEM "html-minimal.dtd">
<html>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
SGML creates an html
element if it isn't there, knowing that an
html
document element must be the first content element in
an HTML file
<!DOCTYPE html SYSTEM "html-minimal.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
SGML infers the head
element if it isn't there, since the
content model requires it at the start of html
's content
<!DOCTYPE html SYSTEM "html-minimal.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
<p>This is the first paragraph.
<p>This is the second.
SGML accepts the title
element as child content of head
,
as allowed by head
s model group expression
<!DOCTYPE html SYSTEM "html-minimal.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<p>This is the first paragraph.
<p>This is the second.
SGML infers the end-element tag for the head
element
since the p
element following title
isn't allowed to
occur in head
<!DOCTYPE html SYSTEM "html-minimal.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.
<p>This is the second.
SGML infers the body
element if it isn't there, since it's
required to follow the head
element
<!DOCTYPE html SYSTEM "html-minimal.dtd">
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.
<p>This is the second.
SGML accepts the first p
element as content of body
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.</p>
<p>This is the second.
</body>
</html>
SGML infers the end-element tag for p
, since p
isn't
accepted as content of p
<html>
<head>
<title>Tag omission in HTML paragraphs</title>
</head>
<body>
<p>This is the first paragraph.</p>
<p>This is the second.</p>
</body>
</html>
SGML infers the end-element tags for p
, body
, and html
at the
end of the document
Given the following attribute declaration
<!ATTLIST option selected (selected) #IMPLIED>
these element/attribute specifications are equivalent:
<option selected>
<option selected=selected>
<option selected="selected">
These rules happen to coincide (mostly) with HTML's attribute minimization features
A restrictive DTD for parsing HTML containing declarations for all elements of HTML 5.1
designed to be used along with
using their official DTDs
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
<!ENTITY % sectioning
"article|aside|nav|section">
Transcription into parameter entity declarations
<!-- Heading content (section 3.2.4.2.4). -->
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
<!-- Sectioning content (section 3.2.4.2.3). -->
<!ENTITY % sectioning "article|aside|nav|section">
<!-- Metadata content (section 3.2.4.2.1). -->
<!ENTITY % metadata
"base|link|meta|noscript|script|
style|template|title">
spacer
The HTML 4.01 DTD contains this declaration:
<!ENTITY % flow "%block; | %inline;">
But HTML 5 only has definitions for flow and phrasing content.
In HTML 5.1, a definition for block content is obtained by subtracting phrasing content from flow content
<!-- Flow elements except phrasing elements. -->
<!ENTITY % block
"address|article|aside|blockquote|details|
div|dl|fieldset|figure|footer|form|
h1|h2|h3|h4|h5|h6|header|hr|main|menu|nav|
ol|p|pre|section|table|ul">
p
element
Content: Phrasing content.
A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.
P
Content: Phrasing content.
is translated to this element declaration
<!ELEMENT p (#PCDATA|%phrasing;)*>
Note Text (character data content) is also phrasing content.
P
A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul, element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.
P
A p element's end tag may be omitted if the p element is immediately followed by an address, article, aside, blockquote, details, div, dl, fieldset, figcaption, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hr, main, menu, nav, ol, p, pre, section, table, or ul,element, or if there is no more content in the parent element and the parent element is an HTML element that is not an a, audio, del, ins, map, noscript, or video element.
P
Complete DTD declarations for the p element
<!ENTITY % phrasing "a|abbr|area|...">
<!ENTITY % block "address|article|...">
<!ELEMENT p - O
(#PCDATA|%phrasing;)* -(%block;)>
<!-- The html element (section 4.1).
Content: A head element followed by a body element.
Tag omission: An html element's start tag can be omitted if the first thing
inside the html element is not a comment. An html element's end tag can be
omitted if the html element is not immediately followed by a comment. -->
<!ELEMENT html O O (head,body) +(script)>
<!ATTLIST html
%extensionattrs;
accesskey NMTOKENS #IMPLIED
class NMTOKENS #IMPLIED
contenteditable (true|false) #IMPLIED
contextmenu IDREF #IMPLIED
sgmljs.net/docs/html5.html
Permissive DTD | Character Set, Names |
ARIA + RDF/A (tbd) | Transparent content |
Tag omission | Character references |
Boolean Attributes | XML empty elements |
Void elements | Unquoted attributes |
Self-closing elements | RAWTEXT and RCDATA |
Script data | Foreign elements |
Custom elements | Custom attributes |
Using sgmljs.net SGML
$ cat test.html
<!DOCTYPE html SYSTEM "html51.dtd">
<title>Test</title>
<body>
<ol hidden reverse>
<li>One</li>
<li>Two</li> </ol>
$ sgmlproc test.html
"testfile.html": line 4: fatal: 'reverse':
token not in token group for any attributes
Using sgmljs.net SGML
$ sed s/reverse/reversed/ test.html > test2.html
$ sgmlproc test2.html
<html>
<head><title>Test</title></head>
<body>
<ol hidden="HIDDEN" reversed="REVERSED">
<li>One</li>
<li>Two</li> </ol>
</body></link></head></html>
$
Using OpenSP SGML
$ osgmlnorm test2.html
<HTML>
<HEAD>
<TITLE>Test</TITLE>
<LINK HREF="style.css" REL="STYLESHEET">
</HEAD>
<BODY><OL HIDDEN REVERSED>
<LI>One</LI>
<LI>Two</LI>
</OL></BODY></HTML>
Fixed flaw in the HTML specification text
Detecting HTML 5.1's datalist
issue
$ head -1 test3.html
<!DOCTYPE html SYSTEM "html51e.dtd">
$ grep 'ELEMENT datalist' html51e.dtd
<!ELEMENT datalist - -
((#PCDATA|%phrasing;)*|(option|%scripting;)*)
-(%flow_only;)>
$ osgmlnorm test3.html
content model is ambiguous:
when no tokens have been matched, both the 1st
and 2nd occurrences of "TEMPLATE" are possible
sgmljs.net/docs/html5.html#the-datalist-element
Unbundling lexical value domains of name tokens
Multi-codepoint predefined named entities
Empty literals in attribute values
Adopting HTML's declared content ending rules
Formally embracing DSDL-9 into SGML
Public identifiers for DTD notations
static site generation
SGML/HTML content is organized in straightforward concepts such as files and folders
<!-- RSS template -->
<!DOCTYPE rss [
<!ELEMENT rss O O (channel*)>
<!ATTLIST rss version CDATA #FIXED "2.0">
<!ELEMENT channel - O (title,description,link,lastBuildDate?,pubDate?,ttl?,item*)>
<!ELEMENT item - - (title,description*,link*,guid*,pubDate*)>
<!ELEMENT title - - (#PCDATA)>
<!ELEMENT description - - (#PCDATA)>
<!ELEMENT link - - (#PCDATA)>
<!ELEMENT lastBuildDate - - (#PCDATA)>
<!ELEMENT pubDate - - (#PCDATA)>
<!ELEMENT ttl - - (#PCDATA)>
<!ELEMENT guid - - (#PCDATA)>
]>
<!LINKTYPE html2rss html rss [
<!ENTITY items SYSTEM>
<!-- ... -->
<!ATTLIST (div|h2) dummy CDATA #IMPLIED>
<!LINK #INITIAL channel channel rss rss title title description description link link lastBuildDate lastBuildDate pubDate pubDate ttl ttl
div #USELINK in-item-title [ class="blog-post" ] item
div #USELINK #EMPTY [ dummy="" ] #IMPLIED>
<!LINK in-item-title
h2 [ class="blog-post-title" ] title
h2 #USELINK #EMPTY [ dummy="" ] #IMPLIED>
]>
<rss version="2.0">
<channel>
<title>sgmljs.net RSS feed</title>
<description>News about sgmljs.net</description>
<link>http://sgmljs.net/blog.html</link>
&items;
HTTP/SGML entity metadata
<!ELEMENT pubDate - - (#PCDATA)>
<!ELEMENT ttl - - (#PCDATA)>
<!ELEMENT guid - - (#PCDATA)>
]>
<!LINKTYPE html2rss html rss [
<!ENTITY items SYSTEM>
<!-- ... -->
<!ATTLIST (div|h2) dummy CDATA #IMPLIED>
<!LINK #INITIAL channel channel rss rss title title description description link link lastBuildDate lastBuildDate pubDate pubDate ttl ttl
div #USELINK in-item-title [ class="blog-post" ] item
div #USELINK #EMPTY [ dummy="" ] #IMPLIED>
<!LINK in-item-title
h2 [ class="blog-post-title" ] title
h2 #USELINK #EMPTY [ dummy="" ] #IMPLIED>
]>
<rss version="2.0">
<channel>
<title>sgmljs.net RSS feed</title>
<description>News about sgmljs.net</description>
<link>http://sgmljs.net/blog.html</link>
&items;
HTML-aware, injection-free transclusion
]>
<rss version="2.0">
<channel>
<title>sgmljs.net RSS feed</title>
<description>News about sgmljs.net</description>
<link>http://sgmljs.net/blog.html</link>
&items;
</channel>
Filtering Templating
<!ELEMENT guid - - (#PCDATA)>
]>
<!LINKTYPE html2rss html rss [
<!ENTITY items SYSTEM>
<!-- ... -->
<!ATTLIST (div|h2) dummy CDATA #IMPLIED>
<!LINK #INITIAL channel channel rss rss title title description description link link lastBuildDate lastBuildDate pubDate pubDate ttl ttl
div #USELINK in-item-title [ class="blog-post" ] item
div #USELINK #EMPTY [ dummy="" ] #IMPLIED>
<!LINK in-item-title
h2 [ class="blog-post-title" ] title
h2 #USELINK #EMPTY [ dummy="" ] #IMPLIED>
]>
<rss version="2.0">
<channel>
<title>sgmljs.net RSS feed</title>
<description>News about sgmljs.net</description>
<link>http://sgmljs.net/blog.html</link>
&items;
First page load against SGML web server is rendered server-side
Subsequent page requests are rendered in-browser, saving server-side content aggregation overhead
HTTP/2 cache, push metadata
Subsequent page requests are rendered in-browser, saving server-side content aggregation overhead
sgmljs.net
command line
sgmljs.net
for Apache
sgmljs.net
for Node.js
spacer
SGML master file for formatting content into slide shows using reveal.js
HTML or markdown-formatted client file containing actual slide content
# Heading for first slide #
(HTML or markdown paragraph text, image, SVG content,
etc. goes here)
---
# Heading for next slide #
(Slide 2 body text)
reveal.js and other web slide packages for markdown
expect input like this, where ---
(or the HTML hr
element)
is used to separate slides
<!DOCTYPE html [
<!ENTITY slides SYSTEM "demo-slides.md"
]>
<html>
<body>
&slides
</body>
</html>
This is what an initial SGML master for slide presentations looks like
SGML will pull content from demo-slides.md
into the replacement text for the slides
entity
<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
<!DOCTYPE html [
<!ENTITY slides SYSTEM "demo-slides.md">
<!ENTITY % md_shortref_maps
PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
%md_shortref_maps;
<!ELEMENT hr - O EMPTY>
]>
<html>
<body>
&slides
</body>
</html>
For markdown, we need these declarations
<!SGML MARKDOWN ... PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
configures markdown delimiters (sequences of characters SGML parses as a single token)
<!ENTITY % md_shortref_maps PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
loads the (virtual) markdown SHORTREF
declarations
%md_shortref_maps;
declares markdown SHORTREF
rules (context-dependent
token-to-element mappings)
<!ELEMENT hr - O EMPTY>
hr
elements don't have end-element tagsreveal.js expects slides in section
elements:
<html>
<body>
<section>
<h1 id="first-slide">Heading for first slide</h1>
</section>
<section>
<h1 id="second-slide">Heading for second slide</h1>
</section>
...
</body>
</html>
So we need to wrap portions between hr
elements
into section
wrappers
<!ELEMENT body - -
(section,(hr,section)*)>
section
element before any child element of body
<!ELEMENT section O O ANY -(hr)>
section
elements and open
new ones on hr
elements, since hr
elements aren't
allowed in section
content by the -(hr)
exclusion
exception# Heading for first slide #
---
# Heading for second slide #
Parsing this with the two declarations yields
<body>
<section>
<h1>Heading for first slide</h1>
</section>
<hr>
<section>
<h1>Heading for second slide</h1>
</section>
</body>
<h1>Heading for first slide</h1>
<hr>
<h1>Heading for second slide</h1>
Equivalently, parsing this HTML fragment yields
<body>
<section>
<h1>Heading for first slide</h1>
</section>
<hr>
<section>
<h1>Heading for second slide</h1>
</section>
</body>
<body>
<section>
<h1>Heading for first slide</h1>
</section>
<hr>
<section>
<h1>Heading for second slide</h1>
</section>
</body>
So this will also produce hr
elements
which isn't desired; we're
going to remove hr
elements now
<!DOCTYPE html [
...
]>
<!LINKTYPE tohtml html html [
<!LINK #INITIAL
html html
body body
section section>
]>
In SGML, we can do so by declaring a link process filtering/transforming source markup into result markup
<!DOCTYPE html [
...
]>
<!LINKTYPE tohtml html html [
<!LINK #INITIAL
html html
body body
section section>
]>
The tohtml
link process is declared to
transform html
to html
markup, in turn
<!DOCTYPE html [
...
]>
<!LINKTYPE tohtml html html [
<!LINK #INITIAL
html html
body body
section section>
]>
The #INITIAL
(and only) link set
copies the specified elements to its result;
hr
elements, however, won't get copied
<!DOCTYPE html [
<!ENTITY slides SYSTEM "demo-slides.md">
<!ENTITY % md_shortref_maps ...> ...
<!ELEMENT html - - (head?,body)>
<!ELEMENT body - - (section,(hr,section)*)>
<!ELEMENT section O O ANY -(hr)>
<!ELEMENT hr - O EMPTY>
]>
<!LINKTYPE tohtml html html [
<!LINK #INITIAL html html body body section section>
]>
<html><body>
&slides
</body></html>
Now that hr
won't get copied,
body
s content model is violated
(which requires section
elements
to be separated by hr
elements):
(section,(hr,section)*)
We therefore declare an additional DOCTYPE;
we'll keep html
as name of the result DOCTYPE
that doesn't require hr
elements and use slides
as source DOCTYPE name for parsing
<!DOCTYPE slides [
<!ELEMENT slides - - (head?,body)>
<!ELEMENT body - - (section,(hr,section)*)>
<!ELEMENT section O O ANY -(hr)>
<!ELEMENT hr - O EMPTY>
<!ENTITY slides SYSTEM ...> ...
]>
<!DOCTYPE html [
<!ELEMENT html - - (head?,body)>
<!ELEMENT body - - (section)*>
<!ELEMENT section - - ANY>
]>
<!LINKTYPE tohtml slides html [
<!LINK #INITIAL slides html body body section section>
]>
<slides><body>
&slides
</body></slides>
The slides
DOCTYPE contains all the declarations of html
;
the document element is renamed to slides
, however
The html
DOCTYPE contains all declarations as before
except for the changed body
content model (and
for the unnecessary markdown SHORTREF
declaration)
<!LINK #INITIAL html html body body section section>
this link set will only copy the explicitly handled/mapped source element and text content; it won't copy other source markup elements
So our result markup will just contain the plain text content of our slides which isn't what we want
We could change the link rule to map each individual HTML element:
<!LINK #INITIAL
...
a a
b b
div div
...
-- etc. -- >
But this would require us to declare and copy all elements over to result markup individually.
<!DOCTYPE #IMPLIED SYSTEM [
<!ENTITY content SYSTEM "<osfd>0">
]>
<section>
&content;
</section>
The "copy" template, stored in a separate copy.sgml
file,
looks like this
<osfd>0
is a formal system identifier
representing the Unix standard input file number 0 (stddin
)
sgmlproc copy.sgml < input.txt
if used on the command line directly like this,
SGML receives the content of the input.txt
file
as standard input when invoked using sgmlproc
For this to work, a section.dtd
file must be
present in the current working directory; SGML will
attempt to resolve a DTD file inferred from the
name of the first encountered content
element (due to <!DOCTYPE SYSTEM #IMPLIED>
)
On the other hand, if used from a master, a template's standard input resolves to the child content of the element on which it is applied in the master
<section>
<!- a template will receive this as
standard input when applied on the
section element: -->
<h2>Section heading</h2>
<p>Section body content</p>
</section>
Moreover, when used from a master, SGML resolves
DOCTYPE #IMPLIED SYSTEM
as if a section.dtd
file were present in the current directory, containing
all the target markup declarations from the calling processing
context:
<!ELEMENT html - - (head?,body)>
<!ELEMENT body - - (section)*>
<!ELEMENT section - - ANY>
To make SGML apply the copy template on the section element, we use these link set declarations:
<!LINK #INITIAL
section [ template=copy ] section>
...>
We re-declare the former section section
link rule such that
it applies the copy
template to section
elements
<!NOTATION sgml
PUBLIC
"ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)">
<!NOTATION copy SYSTEM>
<!ATTLIST #NOTATION copy superdcn name #FIXED sgml>
<!ATTLIST section template NOTATION (copy) #IMPLIED>
These declarations tell SGML that the value of the template
attribute carries an SGML file that must be processed special
<!NOTATION sgml ...
declares sgml
as the name of a notation identified by
the "ISO ..."
public identifier (sgmljs.net SGML recognizes the
public identifier, but sgml
, on the other hand, could be any
other identifier here as long as it is used consistently with
other declarations)
<!NOTATION copy SYSTEM>
declares the copy
file (without .sgml
suffix) as
a notation
<!ATTLIST section template NOTATION (copy) #IMPLIED>
declares the template
data attribute (attribute of
a notation) as having NOTATION attribute type, with copy
the only allowed value (and, by using #IMPLIED
, as an
optional link attribute of the template
element)
<!ATTLIST #NOTATION copy superdcn name #FIXED sgml>
declares the distinguished superdcn
data attribute
of copy
as having the value sgml
, which will make sgmljs.net
SGML treat the notation as SGML template
<!NOTATION sgml
PUBLIC
"ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)">
<!NOTATION copy SYSTEM>
<!ATTLIST #NOTATION copy superdcn name #FIXED sgml>
<!ATTLIST section template NOTATION (copy) #IMPLIED>
<!LINK #INITIAL
slides html
body body
section [ template=copy ] section>
sgmlproc slide.sgm > slides.html
we can now invoke sgmlproc
to preview the generated HTML
on the command line like this (provided we have the slides
and copy
files in place in the same directory)
sgmljs.net SGML will now replace every section element by
the expanded content of the copy template, with the child
content of the respective section elements as value of
the content
entity
<!-- template -->
<!DOCTYPE #IMPLIED SYSTEM [
<!ENTITY param SYSTEM>
]>
<div>¶m;</div
<!-- applying the template --->
<!LINK div [ param="value" ] div>
Templares also allow named parameters
sgmljs.net/docs/templating<!LINK #INITIAL body #USELINK in-body body>
<!LINK in-body div
#USELINK in-body-div div
#POSTLINK after-body-div>
<!LINK in-body-div ...>
<!LINK after-body-div ..>
SGML LINK supports context-dependent application of link rules based on a state-transition/automaton notion
sgmljs.net/docs/templatingreveal.js div wrappers/CSS
embedding a slide show into a website
reveal.js theming (Bootstrap/SCSS tbd)
options for reveal.js controls and speaker notes
SVG positioning quirks for IE 11/Safari 7 and lower
zoom/scroll quirks on touch devices
CDN setup
sgmljs.net/docs/blog-tutorial.htmlspacer
Web Authoring and Delivery based on ISO standards