SGML

Parsing HTML

parsing-html-tutorial.tgz
Download all
blogpage.html
Our running sample HTML page (from http://blogs.harvard.edu/doc/)
blogpage-with-html52mini-doctype.html
Sample HTML page with added HTML 5.2 Mini-DTD DOCTYPE
html52mini.dtd
W3C HTML 5.2 mini-DTD
blogpage-with-html5-doctype-and-modified-html51-dcl.sgm
Sample HTML for use with OpenSP
html5.dtd
(legacy) full HTML 5 DTD for use with OpenSP
blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html
Sample HTML page with DTD-based adding of an xmlns attribute
xhtml.lpd
Declarations for a link process applying mytemplate.sgm on the document element (the html element)
mytemplate.sgm
SGML template recreating an input HTML document with added xml:lang attribute
page-with-download-link-demo.html
Symthetic HTML file demonstrating custom parsing and normalization of download attributes in anchor elements
a-isdownload.sgm
Template for rewriting/normalizing anchor elements (variant with download attribute)
a-nodownload.sgm
Template for rewriting/normalizing anchor elements (variant without download attribute)
page-with-download-link.html
Refined HTML file demonstrating elaborated custom parsing and normalization of download attributes in anchor elements
htmlfix.lpd
Declarations for a link set applying normalization templates on anchors
a-isdownload-lax.sgm
Template for rewriting/normalizing anchor elements (refined variant with download attribute)
a-nodownload-lax.sgm
Template for rewriting/normalizing anchor elements (variant without download attribute)
Note: to execute tests in this directory, download the sgmlproc command-line app for Linux or Mac OS, or get sgmlproc by installing the SGML package for Node.js (in the latter case, unless the sgml package is installed globally, sgmlproc is invoked by using node_modules/.bin/sgmlproc on the command line)

Introduction

In this tutorial, we're going to learn how to perform parsing of real-world HTML as used on actual websites. For the tutorial, we just pick one page of the first site listed as being in danger of shutting down soon on https://indieweb.org/site-deaths, Doc Searls blog (the most recent entries posted in 2019).

Parsing HTML using sgmljs.net

Either open that page in a web browser and save it as a file in a temporary folder, or use the wget or curl command-line programs to do the same:

curl http://blogs.harvard.edu/doc/ > blogpage.html

If you're using a browser, this will create a file named "Doc Searls Weblog ยท Holding forth on stuff since 1998.html" (and potentially a folder named similarly containing images and other resources linked from the page, which we however don't bother with for now). For simplicity, it's recommended (and also assumed in the following text) to rename that long file to just blogpage.html.

Now we can perform our first attempt at parsing the file using sgmlproc:

./sgmlproc blogpage.html

The output (standard and error output) will mix content and error messages such as the following (where the first type of error messages will appear multiple times, and the error messages generally will contain large JavaScript code text portions which are omitted for brevity here):

"blogpage.html": line 10: warning: '<lots of Javascript code text>' : unquoted '&' character
"blogpage.html": line 10: 'j.getContext': unresolved entity reference
"blogpage.html": line 12: fatal: '<i.lengt...': unterminated element or invalid < character in element or attributes

What's going on is that SGML doesn't know that the & ampersand and < less-than characters shouldn't be treated as markup delimiters at the places it complains about, the latter error being a fatal error and making sgmlproc stop further processing.

To tell SGML that it is parsing HTML, we need to download and place a HTML 5 DTD into the directory, edit the blogpage.html file and declare a document type definition, indicating to SGML that script element content should be treated as CDATA (unparsed character data). To do so, we open blogpage.html in a text editor and change

<!DOCTYPE html>

into

<!DOCTYPE html SYSTEM "html52mini.dtd">

where html52mini.dtd refers to a local download of the HTML 5.2 mini-DTD described in HTML5.2 DTD Reference.

With the changed file stored as blogpage-with-html52mini-doctype.html, executing

./sgmlproc blogpage-with-html52mini-doctype.html

again will make sgmlproc process the file successfully and output the expected, canonical HTML.

We may want to compare the input file to the output of sgmlproc to check if sgmlproc has actually done anything at all. So we're running

./sgmlproc blogpage-with-html52mini-doctype.html > out

again, this time storing the produced HTML in out. We then run basic file comparison on the input and output file:

diff blogpage-with-html52mini-doctype.html out

Among lots of output from diff listing differences between input and ouput HTML, at the end of the file we see the following lines:

< <script type='text/javascript' src='https://stats.wp.com/e-201932.js' async defer></script>
< <script type='text/javascript'>
---
> <script async="async" defer="defer" src="https://stats.wp.com/e-201932.js" type="text/javascript"></script>
> <script type="text/javascript">

telling us that sgmlproc has indeed changed the enumerated attributes async and defer into the canonical notations async="async" and defer="defer", respectively. This also shows us that the input file indeed uses HTML 5 features (async and defer were introduced in HTML version 5).

Resolving the HTML DTD by about:legacy-compat

Note with sgmljs.net SGML, we could equivalently use the following:

<!DOCTYPE html SYSTEM "about:legacy-compat">

This needs some explanation: both <!DOCTYPE html> and <!DOCTYPE html SYSTEM "about:legacy-compat"> (but no other strings) are valid, interchangeable DOCTYPE strings as far as HTML is concerned - HTML simply ignores these DOCTYPE declarations at the begin of a file. But to SGML, <!DOCTYPE html> means that the external subset (where markup declarations for elements, attribute, etc. are expected) is empty, whereas <!DOCTYPE html SYSTEM "about:legacy-compat"> tells SGML that the content of the external subset is to be found using a system identifier (eg. a file name or similar) named about:legacy-compat. Now sgmljs.net SGML has built-in support for resolving about:legacy-compat to the W3C HTML 5.2 mini-DTD (it has the HTML declaration set bundled in the executable sgmlproc file).

Parsing HTML using OpenSP

The venerable OpenSP software (fork of James Clark's original SP SGML processing package) is widely regarded as SGML reference implementation. For making it work with modern HTML, we need to consider the following points.

Using html5.dtd

We must use the full HTML5 DTD which is the last backward-compatible HTML 5.x DTD that can be used with OpenSP (due to WebSGML features not implemented in OpenSP, such as declaring attributes on #ALL elements)

SGML declaration for HTML

We must include an SGML declaration (we're using the updated SGML declaration for HTML 4), and the SGML declaration used must contain MINIMIZE SHORTTAG YES rather than the WebSGML syntax for granular/unbundled SHORTTAG features

HTML predefined entities
Also because of OpenSP's lack of support for WebSGML predefined entities, the HTML 5 SP-compatibility DTD pulls-in HTML predefined entities as regular entity declarations from their canonical location, rather than expecting those to be supplied as WebSGML predefined entities
UTF-8 encoding

We must supply SP_BCTF=utf-8 as environment variable to osgmlnorm so OpenSP interprets bytes/characters properly and doesn't balk about non-SGML characters

With these changes in place, our running example HTML begins as follows now (blogpage-with-html5-doctype-and-modified-html51-dcl.sgm):

<!SGML  "ISO 8879:1986 (WWW)"
    --
         Based on the SGML Declaration for HTML 4 (html.dcl), with
	 the following modifications:
         - adapted GRPGTCNT, GRPCNT, ATTCNT
	 - adapted to WebSGML: minimum data, using extended markup minimization
           feature syntax introduced with ISO 8879 Annex K

         See also html5-with-svg-and-mathml.dcl
    --
 
    CHARSET
         BASESET   "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

CAPACITY        SGMLREF

SCOPE    DOCUMENT

SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
           17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0

         FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"    
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  HCRO     "&#38;#x" -- 38 is the number for ampersand --
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTCNT   384     -- increased for HTML51 + SVG   --
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 1024    -- increased for MathML --
                  GRPCNT   256     -- increased for HTML 5, MathML --

FEATURES
        MINIMIZE DATATAG  NO
                 OMITTAG  YES
                 RANK     NO
                 SHORTTAG YES
                    -- WebSGML: --
                    --    STARTTAG EMPTY    NO
                                   UNCLOSED NO
                                   NETENABL NO
                          ENDTAG   EMPTY    NO
                                   UNCLOSED NO
                          ATTRIB   DEFAULT  YES
                                   OMITNAME YES
                                   VALUE    YES --
                 EMPTYNRM NO
                 IMPLYDEF ATTLIST  YES
                          DOCTYPE  NO
                          ELEMENT  NO
                          ENTITY   NO
                          NOTATION NO
         LINK
                 SIMPLE   NO
                 IMPLICIT NO
                 EXPLICIT NO
         OTHER
                 CONCUR   NO
                 SUBDOC   NO
                 FORMAL   NO
APPINFO NONE
>
<!DOCTYPE html SYSTEM "html5.dtd">
<html lang="en-US"><head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>Doc Searls Weblog &middot; Holding forth on stuff since 1998</title>
...

We can now invoke

SP_BCTF=utf-8 osgmlnorm blogpage-with-html5-doctype-and-modified-html51-dcl.sgm

(or another program from the OpenSP suite) for parsing our running HTML page with only 23 (recoverable) errors. Note these errors are HTML validation errors we haven't seen with sgmljs.net SGML since we've used the mini-DTD which doesn't have all element and attribute declarations.

URI parsing troubles

Note that, while not a problem for our HTML at hand, in general OpenSP will have trouble parsing URI values (in eg. HTML href and src attributes) when these contain & ampersand characters. Depending on the subsequent character, OpenSP will complain about entity references formed by ampersand characters not being resolvable, or worse, bogusly expand what it sees as entity references when a general entity happens to be declared for a token appearing after ampersand characters in HTML URI attributes. In sgmljs.net SGML, this problem is solved using data specification attributes which don't get entity-expanded and can describe lexical types such as used in HTML form input validation.

Converting HTML to XML

Continuing with our running example, we now want to produce XML from the input, to then feed it into some of the many available XML processing tools for further extraction or other processing.

With sgmljs.net, this can be done by using

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  blogpage-with-html52mini-doctype.html

and with OpenSP, by running

SGML_BCTF=utf-8 osx -E 1000 \
  blogpage-with-html5-doctype-and-modified-html51-dcl.sgm

where osx is the OpenSP program specifically designed for XML conversion, and where we must increase OpenSP's threshold for errors to 1000 so that it doesn't prematurely abort processing due to too many character encoding errors.

With XML output serialization, elements with declared content EMPTY, such as the img, meta, and hr elements, will be output with end-element tags or XML-style empty-element tags.

Using dtd_handling=omit makes sgmlproc skip outputting a DOCTYPE declaration (which we must do because HTML DTDs are SGML DTDs, and can't be used with XML-only parsers).

Producing XML

So this is what the output of the sgmlproc command stated above looks like (with osx's output being similar):

<HTML LANG="EN-US"><HEAD>
    <META CONTENT="text/html; charset=UTF-8" HTTP-EQUIV="Content-Type">
    </META><TITLE>Doc Searls Weblog &#183; Holding forth on stuff since 1998</TITLE>
    <LINK HREF="//s0.wp.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.gravatar.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.w.org" REL="dns-prefetch"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/feed/" REL="alternate" TITLE="Doc Searls Weblog &#187; Feed" TYPE="application/rss+xml"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/comments/feed/" REL="alternate" TITLE="Doc Searls Weblog &#187; Comments Feed" TYPE="application/rss+xml"></LINK>
<SCRIPT TYPE="text/javascript"><![CDATA[
	window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/svg\/","svgExt":".svg","source":{"concatemoji":"http:\/\/blogs.harvard.edu\/doc\/wp-includes\/js\/wp-emoji-release.min.js?ver=4.8.1"}};
	...
]]></SCRIPT>
<STYLE TYPE="text/css">
	img.wp-smiley,
	img.emoji {
		display: inline !important;
		border: none !important;
		box-shadow: none !important;
		height: 1em !important;
		width: 1em !important;
		margin: 0 .07em !important;
		vertical-align: -0.1em !important;
		background: none !important;
		padding: 0 !important;
	}
</STYLE>
...
<META CONTENT="Holding forth on stuff since 1998" NAME="description">
</META><META CONTENT="all" NAME="robots">
</META><LINK HREF="http://gmpg.org/xfn/11" REL="profile">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/style.css" MEDIA="all" REL="stylesheet" TYPE="text/css">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/library/css/print.css" MEDIA="print" REL="stylesheet" TYPE="text/css">
...
</HEAD>

<BODY CLASS="home blog centre janus" ID="home">

<DIV CLASS="tarski" ID="wrapper">
    <DIV ID="header">
        <DIV ID="header-image"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page"><IMG ALT="Header image" SRC="http://blogs.harvard.edu/doc/files/2014/01/gregory_blogheader3.jpg"></IMG></A></DIV>

<DIV ID="title">
	<H1 ID="blog-title"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page">Doc Searls Weblog</A></H1>
	</DIV>
<DIV CLASS="clearfix" ID="navigation"><UL CLASS="primary xoxo" ID="menu-menu-1"><LI CLASS="menu-item menu-item-type-custom menu-item-object-custom menu-item-8399" ID="menu-item-8399"><A HREF="http://blogs.law.harvard.edu/doc/">Home</A></LI>
...

We note sgmlproc has properly generated end-element tags for elements declared EMPTY such as img, meta, and others, and has also put CDATA marked section markers around content containing < and & characters from elements having declared content CDATA such as script and style.

We can also use the xmllint command-line program (installed as part of libxml2 on Unix-like systems) to verify that the output of sgmlproc is indeed valid XML.

The result, while XML, leaves to be desired, however, since tag and attribute names are produced in uppercase letters. This is because the file is being processed with the HTML 5 SGML declaration (either implicitly by sgmlproc when processing .html files, or explicitly by specifying an SGML declaration at the begin of the file to parse), and the HTML 5 SGML declaration asserts SYNTAX NAMECASE GENERAL YES which will generally perform case-folding on element and attribute, and other name tokens.

To change this, we're going to parse our input document twice with sgmlproc, refeeding the first parse's output into the second run - the first time with the (implicit) SGML declaration for HTMl 5 as before, and by using output_format=html (which guarantees lowercase element and attribute names and other name tokens), and the second time with SYNTAX NAMECASE GENERAL NO and output_format=xml (note we leave out -v dtd_handling=omit from the first invocation, as opposed to our previous run):

./sgmlproc \
  -v output_format=html \
  blogpage-with-html52mini-doctype.html \
  > out.html

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  -v sgmldecl_syntax_namecase_general=NO \
  out.html > out.xml

We can use dtd_handling=omit on the second invocation to get rid of DTD declarations, which we aren't going to need anymore for parsing, since the first parse has taken care of normalizing enumerated attributes into canonical (XML-like) syntax, and the second parse has taken care of producing end-element tags for img and other elements with declared content EMPTY, and also of putting CDATA section markers around script and style content where necessary.

Note: we're leaving OpenSP behind here as it doesn't have these and other options we're using for conversion.

Producing XHTML

For XHTML proper, W3C's HTML 5.2 specification imposes a number of additional constraints on top of requiring generic XML syntax. We're going to focus on the first two items, and leave the other ones as an exercise:

  • the XHTML namespace must be asserted as default xmlns namespace binding for the document (or must be otherwise represented in a way compatible with XML namespaces)

  • HTML's lang attribute should be represented as xml:lang

  • the special handling of HTML's noscript element by browsers may make it desirable to remove noscript elements and their child content from XHTML-serialized HTML documents

  • HTML's href attribute on a base element, if any, should be propagated to an xml:base attribute on the html document element in order to make interpretation of relative URI values conformant with XML assumptions

  • if desired, HTML's id attribute(s) could be represented as xml:id to have XML validate uniqueness of identifiers without additional DTD declarations

  • XLink attributes on foreign elements in HTML (actuate, arcrole, href, role, show, title, type as xlink:actuate and so on) must be preserved

  • also, xmlns and xmlns:xlink attributes must be preserved (and, in general, HTML with embedded SVG and MathML must be handled, which we're not going to do here for space reasons)

Basic creation of XHTML

The first requirement - that of adding an XHTML namespace binding attribute to the html document - is easy enough and can be achieved by merely customizing the internal subset for the HTML 5.2 mini-DTD by using the following declaration in place of the one we've used before:

<!DOCTYPE html SYSTEM "about:legacy-compat" [
	<!ATTLIST html xmlns CDATA #FIXED "http://www.w3.org/1999/xhtml">
]>

This will just always add (or enforce if present) an additional xmlns attribute on the document element. To make use of it, we're editing blogpage-with-html52mini-doctype.html as described, save the result as blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html, and invoke

./sgmlproc \
  -v output_format=html \
  -v dtd_handling=omit \
  blogpage-with-html52mini-doctype-and-xhtml-bs-binding.html \
  > out.html

in place of our earlier invocation.

While this might suffice for basic documents, for additional XHTML rules we're going to have to use additional SGML concepts for describing markup transformations (we could of course also use XML-centric tools such as XSLT to do the same at this point).

SGML link process declarations (LPDs) are an additional type of declaration set (in addition to document type declarations) supported by SGML.

LPDs in the context of a larger SGML prolog can look as follows:

<!DOCTYPE d [
	<!ELEMENT d ... >
	<!EEMENET e ... >
]>
<!LINKTYPE l d #IMPLIED {
	<!ATTLIST e l ...>
	<!LINK #INITIAL e [ l="..." ]>
]>

where the LPD l is declared as an implicit link process associating link attributes to elements declared in the d document type. If it were declared as an explicit link process instead, the link process would take source markup according to the d DTD, and produce target markup according to another DTD as specified in place of #IMPLIED in the link declaration.

The <LINK ... link set declaration in the example establishes a value for the l link attribute on e elements. Note that link attributes (those declared in a link process declaration) are not exposed as content attributes by SGML. Rather, they are meant to be merely supplied to an SGML application (such as a document renderer) as auxiliary information in an unspecified way. Link attributes are similar to CSS properties in that they influence presentation, but are not as such part of the regular document markup stream. While the example link declaration assigns a value unonditionally, Link attributes can also be conditionally associated to content elements based on context and content attribute values, much like CSS selectors.

When using OpenSP, the associated values of link attributes can be made visible using the onsgmls program which produce ESIS output from markup (ESIS is a line-oriented markup representation for easy processing with classic Unix shell tools and Perl, and is also used in SGML test suites as reference output).

To produce the xmlns attribute for the XHTML using SGML LINK with sgmlproc, we can make use of sgmlprocs forward_link_attributes to ouput link attributes as regular content attributes.

We're going to declare our link process for producing XHTML in an extra file (as external entity) rather than inline in the HTML input document because we want to apply a link process at the second stage (on the output of the initial html parse). We're going to use another special command-line flag available to sgmlproc specifically designed to behave as if a particular link process we're giving as parameter is declared in our document when it's not actually contained in the SGML prolog.

We're not going to use the xmlns custom attribute technique explained in the previous section since we want to use LPDs for this purpose; hence we're working on our basic blogpage-with-html52mini-doctypebinding.html again.

Our initial link process, test.lpd, looks as follows:

<!ATTLIST html xmlns CDATA #IMPLIED>
<!LINK #INITIAL html [ xmlns="http://www.w3.org/1999/xhtml" ]>

We invoke the first sgmlproc run as already explained above, and make use of our special command-line flags in the second sgmlproc invocation (note there's also a test.html file to check the second command on a much smaller input):

./sgmlproc \
  -v output_format=html \
  blogpage-with-html52mini-doctype.html \
  > out.html

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  -v sgmldecl_syntax_namecase_general=NO \
  -v active_lpd_names=test \
  -v system_specific_implied_lpd_names=test \
  -v forward_link_attributes=YES \
  out.html > out.xhtml

sgmlproc is instructed to activate our test link process by use of the active_lpd_names parameter, and will receive link attribute and link set declarations from the test.lpd file (sgmlproc looks for a file named after the implied LPD and automatically adds the .lpd file suffix).

The result in out.xhtml is the same as what we produced by adding the xmlns attribute via DTD attribute declarations, and will have a new xmlns attribute set to the XHTML namespace URI on html.

But with this setup, as opposed to our simpler initial model, we can now add additional conversion rules. For a start, we're going to transform the lang attribute into an xml:lang attribute, using whatever value was actually specified in the lang source attribute rather than hard-coding the value in a DTD or LPD attribute declaration. To do this, we're once again rewriting our setup to make use of SGML templating instead of relying on forward_link_attributes.

Producing XHTML using templates

Templating is a versatile SGML technique introduced with sgmljs.net SGML to replace content of source files with that of "template" SGML files at spaces specified in link rules or #CONREF attributes in content with type-safety and support for parameters. A template file to be inserted into result markup is a regular, standalone SGML file expected to parse as the element type which it replaces in source markup.

For example, to add an xml:lang attribute, the html element in our source document is targetted, and recreated using a template file such as the following:

<!DOCTYPE #IMPLIED SYSTEM [
	<!ENTITY lang SYSTEM>
	...
]>
<html xml:lang="&lang">
...

This template receives the lang parameter as SGML system-specific entity (declared without a system identifier) and references the entity to obtain it's value in attribute content.

DOCTYPE #IMPLIED SYSTEM is a WebSGML feature for determining the document element name from the first content element in a markup stream, and for obtaining DTD declarations via a system-specific entity, resolved to the file XXX.dtd by sgmljs.net SGML where XXX is the name of the document element.

Actually, we could leave out the DTD here alltogether, since DOCTYPE #IMPLIED SYSTEM is assumed by default and IMPLYDEF ENTITY YES semantics expressed in the default SGML declaration allows references to undeclared (general) entities with their declaration implied to be system-specific (as per the WebSGML specification).

Let's take a look at the place where this template is applied in the source document:

<DOCTYPE html ...>
<!LINKTYPE xhtml [
  <!NOTATION tmpl
        PUBLIC
      "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
    "mytemplate.sgm">
  <!ATTLIST #NOTATION sgml
    lang CDATA #IMPLIED>
  <!ATTLIST html
    template NOTATION (tmpl) #IMPLIED>
  <!LINK #INITIAL html [ template=tmpl ]>
]>
<html lang="en-US">...
...

These declarations

  • establish the tmpl notation as an SGML file stored in mytemplate.sgm with the lang data attribute

  • declare a link attribute on the html element with declared value a notation (the tmpl notation)

  • set up a link rule to assign the tmpl notation to the notation link attribute on html

Using the SGML public identifier in this way makes sgmljs.net SGML apply the template on the html element.

At runtime, the value for the lang data attribute is populated from the content attribute on the html eelement having the same name, as per the ISO 10744's DAFE specification.

But we're not ready yet: since we're actually targetting the document element in this particular case, we basically also re-create the whole source document, including child content of the source html element in the template file:

<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY lang SYSTEM>
  <!ENTITY content SYSTEM "<osfd>0">
]>
<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="&lang">
&content
</html>

We do so by declaring an additional entity resolving to the "standard input" (Unix file descriptor 0). <osfd>0 is a a Formal System Identifier and a notation to refer to the standard input character stream introduced by ISO 10744's FSIDR specification. sgmljs.net SGML, when executing a template processing sub-context on html, supplies the entire child content of the element on which the template is invoked via <osfd>0.

In effect, our template, up to adding xml:lang, is acting as an identity transform. We don't have anything special to do on the invoking site (in source markup), because sgmljs.net SGML populates <osfd>0 by default.

So lets put our template to work by storing the above snippet in mytemplate.sgm. Moreover, we create xhtml.lpd (which we're going to use to inject our LPD into out.html as described before using test.lpd) as follows:

<!NOTATION tmpl
  PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
  "mytemplate.sgm">
<!ATTLIST #NOTATION tmpl
  lang CDATA #IMPLIED>
<!ATTLIST html
  template NOTATION (tmpl) #IMPLIED
  lang CDATA #IMPLIED>
<!LINK #INITIAL html [ template=tmpl ]>

Now by invoking our sgmlproc commands again (the first one is unchanged from before, and the second one activates the xhtml link process rather than test):

./sgmlproc \
  -v output_format=html \
  blogpage-with-html52mini-doctype.html \
  > out.html

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  -v sgmldecl_syntax_namecase_general=NO \
  -v active_lpd_names=xhtml \
  -v system_specific_implied_lpd_names=xhtml \
  -v forward_link_attributes=YES \
  out.html > out.xhtml

we now obtain a decent out.xhtml file we can feed into (hypothetical) tools for further processing of XHTML.

Advanced parsing and DTD customization

Parsing download attributes

download, according to WHATWG specs since at least 2015 (but not in W3C HTML as of 5.2), can be used both with, or without attribute value. In SGML terms, download can be used as an attribute with CDATA declared value, or can be used as a name token on <a> anchor elements:

<a href="..." download="...">
<a href="..." download>

In our canonical output markup, we either want to have download used as a regular attribute with a value (equal to href if missing in source markup, or otherwise with the specified value), or don't want to have download specified as attribute or name token at all.

To make HTML parsing behave as expected, the following things have to be done:

  • an attribute (isdownload, say) must be declared in addition to those already declared in the html52.dtd or html52mini.dtd; the isdownload attribute should be able to take on the download name token, such that using attribute-minimized form <a download> will be treated same as <a isdownload=download>

  • a link process needs to be applied, dispatching on various alternatives for specifying a download link (eg. with explicit value for download as opposed mere presence of the download name token)

Templating href attributes

As part of templating <a> anchor elements, generally speaking, we'll be rewriting href values captured from source markup into entity references in attribute values specified in a template containing markup such as

<a href="&href ...>

The href attribute is declared, via the URI parameter entity in the HTML 5.2 DTD and Mini-DTD, as a data specification attribute (URI lexical type) and thus treated as-is, rather than entity-expanded. We therefore have to redeclare the href attribute such that the value above is recognized as an entity reference.

The html52.dtd and html52mini.dtd support, via preempting/redefining the if_uri_data_spec_attr parameter entitiy to an "IGNORE" value, the option for alternatively declaring URI values with CDATA declared value, as shown in the following example.

A comprehensive example

The following synthetic SGML document implements this customization in its internal subset, and also includes an additional declaration for the download name token. Moreover, it includes a link process declaration for selecting an appropriate template based on whether

  • download is present as name token,

  • download isnt't present at all (neither as name token nor as attribute), or

  • download is present as attribute,

respectively (page-with-download-link-demo.sgm):

<!DOCTYPE html SYSTEM "html52mini.dtd" [
  <!ENTITY % if_uri_data_spec_attr "IGNORE">
  <!ATTLIST a
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED>
]>
<!LINKTYPE htmlfix html #IMPLIED [
  <!NOTATION a-isdownload
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
    "a-isdownload.sgm">
  <!ATTLIST #NOTATION a-isdownload
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED
    href CDATA #IMPLIED>
  <!NOTATION a-nodownload
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
    "a-nodownload.sgm">
  <!ATTLIST #NOTATION a-nodownload
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED
    href CDATA #IMPLIED>
  <!ATTLIST a
    template NOTATION (a-isdownload|a-nodownload) #IMPLIED
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED
    href CDATA #IMPLIED>
  <!LINK #INITIAL
    a [ isdownload=download template=a-isdownload ]
    a [ download="" template=a-nodownload ]
    a [ template=a-isdownload ]>
]>
<html>
  <head>
    <title>Page containing minimized download attribute</title>
  </head>
  <body>
    <a href="/someurl" download>Download Link</a>
    <a href="/otherurl">Regular Link</a>
  </body>
</html>

The content of the a-isdownload.sgm and a-nodownload.sgm templates, respetively is:

<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY content SYSTEM "<osfd>0">
  <!ENTITY href SYSTEM>
]>
<a href="&href" download="&href">&content</a>
<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY content SYSTEM "<osfd>0">
  <!ENTITY href SYSTEM>
]>
<a href="&href">&content</a>

To invoke production of normalized <a> anchor elements from this document, invoke:

./sgmlproc \
  -v output_format=html \
  -v sgmldecl_syntax_namecase_general=NO \
  -v active_lpd_names=htmlfix \
  page-with-download-link-demo.html

The result markup wth normalized <a> anchor elements looks like this:

<html>
  <head>
    <title>Page containing minimized download attribute</title>
  </head>
  <body>
    <a download="/someurl" href="/someurl">Download Link</a>
    <a href="/otherurl">Regular Link</a>
  </body>
</html>

Refining download attribute handling

With this solution, we have changed the href attribute declaration globally, and for all URI-typed attributes. This customized declaration for URI attributes (like those for all other element and attributes) is propagated into the processing context for the template application on <a>.

We may want to restrict this interpretation to only href attributes in <a> elements, and to the template processing subcontext, rather than the primary parsing context.

For the first issue, we can alternatively preempt HTML.a.attlist to "IGNORE", and supply our own attribute declarations for <a>. For the second issue, we can restrict our modification to only the internal subset used in the template processing context for the template being applied on <a> elements.

For being able to influence the processing context used in a template application, we need to allow lax templating. Normally (in strict templating), sgmlproc checks that an SGML file used as template has <!DOCTYPE ... SYSTEM> as document prolog, where the declaration set resolved by SYSTEM is propagated by the calling main parsing context.

./sgmlproc \
  -v output_format=html \
  -v sgmldecl_syntax_namecase_general=NO \
  -v active_lpd_names=htmlfix \
  -v system_specific_implied_lpd_names=htmlfix \
  -v enable_lax_templates=YES \
  page-with-download-link.html

See page-with-downoad-link.html, htmlfix.lpd, a-isdownload-lax.sgm, and a-nodownload-lax.sgm implementing this variant. Note we, again, make use of the system_specific_implied_lpd_names feature to inject a link process declaration (with declarations from htmlfix.lpd) into the parsed HTML file.

Further customization

(left as an exercise)

  • consider validating variant types/content models of eg. input elements (where an attribute determines the content model and/or type of other attributes)

  • consider checking accessibility such as checking for presence of alt attributes and proper use of ARIA