SGML

Parsing HTML

File Index

parsing-html-tutorial.tgz: Download all files

sgmlproc: sgmlproc command-line app/script for Linux and Mac OS; note this file must be made executable after download using "chmod +x sgmlproc"

blogpage.html: Running sample HTML page (retrieved from http://blogs.harvard.edu/doc/ on Sep. 18th, 2019)

blogpage-with-html52mini-doctype.html: Sample HTML page with added HTML 5.2 Mini-DTD DOCTYPE

html52mini.dtd: W3C HTML 5.2 mini-DTD

blogpage-with-html52mini-doctype-and-custom-url.html: Sample HTML page with HTML 5.2 Mini-DTD DOCTYPE and URL attribute customization

blogpage-with-html5-doctype-and-modified-html51-dcl.sgm: Sample HTML for use with OpenSP

html5.dtd: (Legacy) full HTML 5 DTD for use with OpenSP

blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html: Sample HTML page with DTD-based adding of an xmlns attribute

xhtml.lpd: Declarations for a link process applying mytemplate.sgm on the document element (the html element)

mytemplate.sgm: SGML template recreating an input HTML document with added xml:lang attribute

page-with-download-link-demo.html: Synthetic HTML file demonstrating custom parsing and normalization of download attributes in anchor elements

a-isdownload.sgm: Template for rewriting/normalizing anchor elements (variant with download attribute)

a-nodownload.sgm: Template for rewriting/normalizing anchor elements (variant without download attribute)

page-with-download-link.html: Refined HTML file demonstrating elaborated custom parsing and normalization of download attributes in anchor elements

htmlfix.lpd: Declarations for a link set applying normalization templates on anchors

a-isdownload-lax.sgm: Template for rewriting/normalizing anchor elements (refined variant with download attribute)

a-nodownload-lax.sgm: Template for rewriting/normalizing anchor elements (variant without download attribute)

Note: to run exercises in this tutorial, you need the sgmlproc command-line app for processing SGML. You can download sgmlproc for Linux or Mac OS as part of the download archive for this tutorial (see the Download all files or individual download item in the file index menu above). Alternatively, or if you're on Windows, you can download an equivalent sgmlproc command-line app by installing the SGML package for Node.js. When using Node.js, unless the SGML package is installed globally, sgmlproc is invoked by using node_modules/.bin/sgmlproc on the command line instead of ./sgmlproc.

Introduction

In this tutorial, we're going to learn how to perform parsing of real-world HTML as used on actual websites. For the tutorial, we just pick a random page of the first site listed as being in danger of shutting down soon on https://indieweb.org/site-deaths, Doc Searls blog (entries posted in 2019).

Parsing HTML using sgmljs.net SGML

Either open that page in a web browser and save it as a file in a temporary folder, or use the wget or curl command-line programs to do the same:

curl http://blogs.harvard.edu/doc/ > blogpage.html

If you're using a browser, this will create a file named "Doc Searls Weblog - Holding forth on stuff since 1998.html" (and potentially a folder named similarly containing images and other resources linked from the page, which we however don't bother with for now). For simplicity, it's recommended (and also assumed in the following text) to rename that long file to just blogpage.html.

Now we can perform our first attempt at parsing the file using sgmlproc:

./sgmlproc blogpage.html

The output (standard and error output) will mix content and error messages such as the following (where the first type of error messages will appear multiple times, and the error messages generally will contain large JavaScript code text portions which are omitted for brevity here):

"blogpage.html": line 10: warning: '<lots of Javascript code text>' : unquoted '&' character
"blogpage.html": line 10: 'j.getContext': unresolved entity reference
"blogpage.html": line 12: fatal: '<i.lengt...': unterminated element or invalid < character in element or attributes

Using the HTML5.2 mini-DTD

What's going on is that SGML doesn't know that the & ampersand and < less-than characters shouldn't be treated as markup delimiters at the places it complains about, the latter error being a fatal error and making sgmlproc stop further processing.

To tell SGML that it is parsing HTML, we need to download and place a HTML 5 DTD into the directory, edit the blogpage.html file and declare a document type definition, indicating to SGML that script element content should be treated as CDATA (unparsed character data). To do so, we open blogpage.html in a text editor and change

<!DOCTYPE html>

into

<!DOCTYPE html SYSTEM "html52mini.dtd">

where html52mini.dtd refers to a local download of the HTML 5.2 mini-DTD described in HTML5.2 DTD Reference.

With the changed file stored as blogpage-with-html52mini-doctype.html, executing

./sgmlproc blogpage-with-html52mini-doctype.html

again will make sgmlproc at least parse the file completely and output canonically-formatted HTML markup.

Customize URL parsing

However, there are still error messages reported by sgmlproc buried in its large terminal console output left. We shall instruct sgmlproc to output its parsed and re-serialized markup into a file rather than onto the terminal:

./sgmlproc -- -o out.html blogpage-with-html52mini-doctype.html

This will make sgmlproc store HTML output into out.html and report only error messages onto the terminal:

"blogpage-with-html52mini-doctype.html": line 33: element LINK: attribute HREF: 'https://fonts.googleapis.com/css?family=Open+Sans%3A300italic%2C400italic%2C600italic%2C300%2C400%2C600&#038;subset=latin%2Clatin-ext&#038;ver=4.8.1': invalid value for declared data notation
"blogpage-with-html52mini-doctype.html": line 154: element A: attribute HREF: 'http://@robwilliamsNY': invalid value for declared data notation
"blogpage-with-html52mini-doctype.html": line 291: element A: attribute HREF: 'http://Marvel-Like Universe in Which All of Us are Enhanced': invalid value for declared data notation

The reason for sgmlproc warning about URLs not having proper syntax is that W3C's HTML5.2 specification (kindof) recommends use of RFC 3986 URLs as opposed to the more permissive variant specified in WHATWG's URL Standard.

The HTML5.2 DTD (and HTML5.2 mini-DTD) represents this lexical constraint by declaring URL-typed attributes with WebSGML data specification attributes for a custom URL lexical type derived from the description of HTML <form> input values. For example, the href attribute on HTML <a> anchor elements is declared as follows in html52mini.dtd:

<!ENTITY % if_uri_data_spec_attr "INCLUDE">
<![ %if_uri_data_spec_attr [
  <!NOTATION uri
    PUBLIC "+//IDN www.w3c.org/TR/html5//NOTATION HTML Form Input Types//EN">
  <!ATTLIST #NOTATION uri type (url) #FIXED url>
  <!ENTITY % URI "DATA uri">
]]>
<!ENTITY % no_uri_data_spec_attr "INCLUDE">
<![ %no_uri_data_spec_attr [
  <!ENTITY % URI "CDATA">
]]>
...
<!ATTLIST a href %URI #IMPLIED>

This delaration for the %URI parameter entity is put into a marked section such that it can be conditionally included or excluded based on the value of the if_uri_data_spec_attr parameter entity. To switch off checking for RFC 3986 conformance on URI-typed attributes, we can change the internal subset in the SGML prolog for our running blogpage example document to preempt if_uri_data_spec_attr with a value of IGNORE, thereby overriding the DTD's default of INCLUDE, and store the edited file as eg. blogpage-with-html52mini-doctype-and-custom-url:

<!DOCTYPE html SYSTEM "html52mini.dtd" [
  <!ENTITY % if_uri_data_spec_attr "IGNORE">
]>

This will declare the href and other URI-typed attributes as plain CDATA attributes, and make our remaining error messages go away when re-invoking sgmlproc on it:

./sgmlproc -- -o out.html blogpage-with-html52mini-doctype-and-custom-url.html

Examining canonical HTML output

We may want to compare the input file to the output of sgmlproc to check if sgmlproc has actually done anything at all. So we're going to run basic file comparison on the input and output file:

diff blogpage-with-html52mini-doctype-and-custom-url.html out.html

Among lots of output from diff listing differences between input and ouput HTML, at the end of the file we see the following lines:

< <script type='text/javascript' src='https://stats.wp.com/e-201932.js' async defer></script>
< <script type='text/javascript'>
---
> <script async="async" defer="defer" src="https://stats.wp.com/e-201932.js" type="text/javascript"></script>
> <script type="text/javascript">

telling us that sgmlproc has indeed changed the enumerated attributes async and defer into the canonical notations async="async" and defer="defer", respectively. This also shows us that the input file indeed uses HTML 5 features (async and defer were introduced in HTML version 5).

Resolving the HTML DTD by about:legacy-compat

Note with sgmljs.net SGML, we could equivalently use the following:

<!DOCTYPE html SYSTEM "about:legacy-compat">

or, with our URL parsing customization included:

<!DOCTYPE html SYSTEM "about:legacy-compat" [
  <!ENTITY % if_uri_data_spec_attr "IGNORE">
]>

This needs some explanation: both <!DOCTYPE html> and <!DOCTYPE html SYSTEM "about:legacy-compat"> (but no other strings) are valid, interchangeable DOCTYPE strings as far as HTML is concerned - HTML simply ignores these DOCTYPE declarations at the begin of a file. But to SGML, <!DOCTYPE html> means that the external subset (where markup declarations for elements, attribute, etc. are expected) is empty, whereas <!DOCTYPE html SYSTEM "about:legacy-compat"> tells SGML that the content of the external subset is to be found using a system identifier (eg. a file name or similar) named about:legacy-compat. Now sgmljs.net SGML has built-in support for resolving about:legacy-compat to the W3C HTML 5.2 mini-DTD (it has the HTML declaration set bundled in the executable sgmlproc file).

Parsing HTML using OpenSP

The venerable OpenSP software (fork of James Clark's original SP SGML processing package) is widely regarded as SGML reference implementation. For making it work with modern HTML, we need to consider the following points.

Using html5.dtd: We must use the full HTML5 DTD which is the last backward-compatible HTML 5.x DTD that can be used with OpenSP (due to WebSGML features not implemented in OpenSP, such as declaring attributes on #ALL elements)

SGML declaration for HTML: We must include an SGML declaration (we're using the updated SGML declaration for HTML 4), and the SGML declaration used must contain MINIMIZE SHORTTAG YES rather than the WebSGML syntax for granular/unbundled SHORTTAG features

HTML predefined entities: Also because of OpenSP's lack of support for WebSGML predefined entities, the HTML 5 SP-compatibility DTD pulls-in HTML predefined entities as regular entity declarations from their canonical location, rather than expecting those to be supplied as WebSGML predefined entities

UTF-8 encoding: We must supply SP_BCTF=utf-8 as environment variable to osgmlnorm so OpenSP interprets bytes/characters properly and doesn't balk about non-SGML characters

With these changes in place, our running example HTML begins as follows now (blogpage-with-html5-doctype-and-modified-html51-dcl.sgm):

<!SGML  "ISO 8879:1986 (WWW)"
    --
         Based on the SGML Declaration for HTML 4 (html.dcl), with
	 the following modifications:
         - adapted GRPGTCNT, GRPCNT, ATTCNT
	 - adapted to WebSGML: minimum data, using extended markup minimization
           feature syntax introduced with ISO 8879 Annex K

         See also html5-with-svg-and-mathml.dcl
    --
 
    CHARSET
         BASESET   "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

CAPACITY        SGMLREF

SCOPE    DOCUMENT

SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
           17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0

         FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"    
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  HCRO     "&#38;#x" -- 38 is the number for ampersand --
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTCNT   384     -- increased for HTML51 + SVG   --
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 1024    -- increased for MathML --
                  GRPCNT   256     -- increased for HTML 5, MathML --

FEATURES
        MINIMIZE DATATAG  NO
                 OMITTAG  YES
                 RANK     NO
                 SHORTTAG YES
                    -- WebSGML: --
                    --    STARTTAG EMPTY    NO
                                   UNCLOSED NO
                                   NETENABL NO
                          ENDTAG   EMPTY    NO
                                   UNCLOSED NO
                          ATTRIB   DEFAULT  YES
                                   OMITNAME YES
                                   VALUE    YES --
                 EMPTYNRM NO
                 IMPLYDEF ATTLIST  YES
                          DOCTYPE  NO
                          ELEMENT  NO
                          ENTITY   NO
                          NOTATION NO
         LINK
                 SIMPLE   NO
                 IMPLICIT NO
                 EXPLICIT NO
         OTHER
                 CONCUR   NO
                 SUBDOC   NO
                 FORMAL   NO
APPINFO NONE
>
<!DOCTYPE html SYSTEM "html5.dtd">
<html lang="en-US"><head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <title>Doc Searls Weblog &middot; Holding forth on stuff since 1998</title>
...

We can now invoke

SP_BCTF=utf-8 osgmlnorm blogpage-with-html5-doctype-and-modified-html51-dcl.sgm

(or another program from the OpenSP suite) for parsing our running HTML page with only 23 (recoverable) errors. Note these errors are HTML validation errors we haven't seen with sgmljs.net SGML since we've used the mini-DTD which doesn't have all element and attribute declarations.

While not a problem for our HTML at hand, in general OpenSP and other third-party SGML software without support for full WebSGML will have trouble parsing URI values (in eg. HTML href and src attributes) when these contain & ampersand characters. Depending on the subsequent character, OpenSP will complain about entity references formed by ampersand characters not being resolvable, or worse, bogusly expand what it sees as entity references when a general entity happens to be declared for a token appearing after ampersand characters in HTML URI attributes.

In sgmljs.net SGML, this is solved using WebSGML data specification attributes (as explained above) which don't get entity-expanded. On OpenSP, on the other hand, this problem can only be solved by accepting (a potential large number of) recoverable errors and let OpenSP output URLs as-is from input.

Converting HTML to XML

Continuing with our running example, we now want to produce XML from the input, to then feed it into some of the many available XML processing tools for further extraction or other processing.

With sgmljs.net SGML, this can be done by using

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  blogpage-with-html52mini-doctype-and-custom-url.html

and with OpenSP, by running

SGML_BCTF=utf-8 osx -E 1000 \
  blogpage-with-html5-doctype-and-modified-html51-dcl.sgm

where osx is the program of the OpenSP suite specifically designed for XML conversion, and where we must increase OpenSP's threshold for errors to 1000 so that it doesn't prematurely abort processing due to too many character encoding errors.

With XML output serialization, elements with declared content EMPTY, such as the img, meta, and hr elements, will be output with end-element tags or XML-style empty-element tags.

Using dtd_handling=omit makes sgmlproc skip outputting a DOCTYPE declaration (which we must do because HTML DTDs are SGML DTDs, and can't be used with XML-only parsers).

Producing XML

So this is what the output of the sgmlproc command stated above looks like (with osx's output being similar):

<HTML LANG="EN-US"><HEAD>
    <META CONTENT="text/html; charset=UTF-8" HTTP-EQUIV="Content-Type">
    </META><TITLE>Doc Searls Weblog &#183; Holding forth on stuff since 1998</TITLE>
    <LINK HREF="//s0.wp.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.gravatar.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.w.org" REL="dns-prefetch"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/feed/" REL="alternate" TITLE="Doc Searls Weblog &#187; Feed" TYPE="application/rss+xml"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/comments/feed/" REL="alternate" TITLE="Doc Searls Weblog &#187; Comments Feed" TYPE="application/rss+xml"></LINK>
<SCRIPT TYPE="text/javascript"><![CDATA[
	window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/svg\/","svgExt":".svg","source":{"concatemoji":"http:\/\/blogs.harvard.edu\/doc\/wp-includes\/js\/wp-emoji-release.min.js?ver=4.8.1"}};
	...
]]></SCRIPT>
<STYLE TYPE="text/css">
	img.wp-smiley,
	img.emoji {
		display: inline !important;
		border: none !important;
		box-shadow: none !important;
		height: 1em !important;
		width: 1em !important;
		margin: 0 .07em !important;
		vertical-align: -0.1em !important;
		background: none !important;
		padding: 0 !important;
	}
</STYLE>
...
<META CONTENT="Holding forth on stuff since 1998" NAME="description">
</META><META CONTENT="all" NAME="robots">
</META><LINK HREF="http://gmpg.org/xfn/11" REL="profile">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/style.css" MEDIA="all" REL="stylesheet" TYPE="text/css">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/library/css/print.css" MEDIA="print" REL="stylesheet" TYPE="text/css">
...
</HEAD>

<BODY CLASS="home blog centre janus" ID="home">

<DIV CLASS="tarski" ID="wrapper">
    <DIV ID="header">
        <DIV ID="header-image"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page"><IMG ALT="Header image" SRC="http://blogs.harvard.edu/doc/files/2014/01/gregory_blogheader3.jpg"></IMG></A></DIV>

<DIV ID="title">
	<H1 ID="blog-title"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page">Doc Searls Weblog</A></H1>
	</DIV>
<DIV CLASS="clearfix" ID="navigation"><UL CLASS="primary xoxo" ID="menu-menu-1"><LI CLASS="menu-item menu-item-type-custom menu-item-object-custom menu-item-8399" ID="menu-item-8399"><A HREF="http://blogs.law.harvard.edu/doc/">Home</A></LI>
...

We note sgmlproc has properly generated end-element tags for elements declared EMPTY such as img, meta, and others, and has also put CDATA marked section markers around content containing < and & characters from elements having declared content CDATA such as script and style.

We can also use the xmllint command-line program (installed as part of libxml2 on Unix-like systems) to verify that the output of sgmlproc is indeed valid XML.

The result, while XML, leaves to be desired, however, since tag and attribute names are produced in uppercase letters. This is because the file is being processed with the HTML 5 SGML declaration (either implicitly by sgmlproc when processing .html files, or explicitly by specifying an SGML declaration at the begin of the file to parse), and the HTML 5 SGML declaration asserts SYNTAX NAMECASE GENERAL YES which will generally perform case-folding on element and attribute, and other name tokens.

To change this, we're going to parse our input document twice with sgmlproc, refeeding the first parse's output into the second run - the first time with the (implicit) SGML declaration for HTMl 5 as before, and by using output_format=html (which guarantees lowercase element and attribute names and other name tokens), and the second time with SYNTAX NAMECASE GENERAL NO and output_format=xml (note we leave out -v dtd_handling=omit from the first invocation, as opposed to our previous run):

./sgmlproc \
  -v output_format=html \
  -- -o out.html \
  blogpage-with-html52mini-doctype-and-custom-url.html

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  -v sgmldecl_syntax_namecase_general=NO \
  -- -o out.xml \
  out.html

We can use dtd_handling=omit on the second invocation to get rid of DTD declarations, which we aren't going to need anymore for parsing, since the first parse has taken care of normalizing enumerated attributes into canonical (XML-like) syntax, and the second parse has taken care of producing end-element tags for img and other elements with declared content EMPTY, and also of putting CDATA section markers around script and style content where necessary.

Note: we're leaving OpenSP behind here as it doesn't have these and other options we're using for conversion.

Producing XHTML

For XHTML proper, W3C's HTML 5.2 specification imposes a number of additional constraints on top of requiring generic XML syntax. We're going to focus on the first two items, and leave the other ones as an exercise:

the XHTML namespace must be asserted as default xmlns namespace binding for the document (or must be otherwise represented in a way compatible with XML namespaces)
HTML's lang attribute should be represented as xml:lang
the special handling of HTML's noscript element by browsers may make it desirable to remove noscript elements and their child content from XHTML-serialized HTML documents
HTML's href attribute on a base element, if any, should be propagated to an xml:base attribute on the html document element in order to make interpretation of relative URI values conformant with XML assumptions
if desired, HTML's id attribute(s) could be represented as xml:id to have XML validate uniqueness of identifiers without additional DTD declarations
XLink attributes on foreign elements in HTML (actuate, arcrole, href, role, show, title, type as xlink:actuate and so on) must be preserved
also, xmlns and xmlns:xlink attributes must be preserved (and, in general, HTML with embedded SVG and MathML must be handled, which we're not going to do here for space reasons)

Basic creation of XHTML

The first requirement - that of adding an XHTML namespace binding attribute to the html document - is easy enough and can be achieved by merely customizing the internal subset for the HTML 5.2 mini-DTD by using the following declaration in place of the one we've used before:

<!DOCTYPE html SYSTEM "html52mini.dtd" [
  <!ENTITY % if_uri_data_spec_attr "IGNORE">
  <!ATTLIST html xmlns CDATA #FIXED "http://www.w3.org/1999/xhtml">
]>

(which adds an attribute list declaration to our earlier addition for preempting the if_uri_data_spec_attr parameter entity for URL customization)

This will just always add (or enforce if present) an additional xmlns attribute on the document element. To make use of it, we're editing blogpage-with-html52mini-doctype.html as described, save the result as blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html, and invoke

./sgmlproc \
  -v output_format=html \
  -v dtd_handling=omit \
  -- -o out.html \
  blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html

in place of our earlier invocation.

Advanced XHTML conversion using SGML LINK

While this might suffice for basic documents, for additional XHTML rules we're going to have to use additional SGML concepts for describing markup transformations (we could of course also use XML-centric tools such as XSLT to do the same at this point).

SGML link process declarations (LPDs) are an additional type of declaration set (in addition to document type declarations) supported by SGML.

LPDs in the context of a larger SGML prolog can look as follows:

<!DOCTYPE doc [
	<!ELEMENT doc ... >
	<!ELMENET el ... >
]>
<!LINKTYPE lnk doc #IMPLIED {
	<!ATTLIST el lnkatt ...>
	<!LINK #INITIAL el [ lnkatt="some value" ]>
]>
<doc>
<!- ... further content goes here ... ->
</doc>

where the LPD lnk is declared as an implicit link process associating link attributes to elements declared in the doc document type. If it were declared as an explicit link process instead, the link process could eg. take source markup according to the doc DTD, and produce target markup according to another DTD as specified in place of #IMPLIED in the link declaration. Multiple explicit link processes can be executed to form a pipeline where the result markup of one stage is fed as source markup stream into the next prcoess, and pipelines can be configured automatically by SGML based on a desired target document type "view" requested by the user.

The <LINK ... link set declaration in the example establishes some value as value for the lnkatt link attribute on el elements. Similar to CSS properties in HTML, link attributes (those declared in a link process declaration) are not exposed as content attributes by SGML, even though they're declared using regular attribute list declaration syntax. Also much like CSS selectors, while the example link declaration assigns a value unconditionally, link attributes can also be conditionally assigned to content elements based on content attribute values and element state context.

When using OpenSP, the associated values of link attributes can be made visible using the onsgmls program which produce ESIS output from markup (ESIS is a line-oriented markup representation for easy processing with classic Unix shell tools and Perl, and is also used in SGML test suites as reference output).

To produce the xmlns attribute using SGML LINK with sgmlproc, we can make use of sgmlprocs forward_link_attributes option to output link attributes as regular content attributes.

The xhtml link process

We're going to declare our link process for producing XHTML in an extra file (as external entity) rather than inline in the HTML input document because we want to apply a link process at the second stage (on the output of the initial html parse). We're going to use another special command-line flag available to sgmlproc specifically designed to behave as if a particular link process we're giving as command-line parameter is declared in our document when it's not actually contained in the SGML prolog.

We're not going to use the xmlns custom attribute technique explained in the previous section since we want to use LPDs for this purpose; hence we're working on our basic blogpage-with-html52mini-doctype-and-custom-url.html file again.

Our initial link process, test.lpd, looks as follows:

<!ATTLIST html xmlns CDATA #IMPLIED>
<!LINK #INITIAL html [ xmlns="http://www.w3.org/1999/xhtml" ]>

We invoke the first sgmlproc run as already explained above, and make use of our special command-line flags in the second sgmlproc invocation (note there's also a test.html file to check the second command on a much smaller input):

./sgmlproc \
  -v output_format=html \
  -- -o out.html \
  blogpage-with-html52mini-doctype-and-custom-url.html

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  -v sgmldecl_syntax_namecase_general=NO \
  -v active_lpd_names=test \
  -v system_specific_implied_lpd_names=test \
  -v forward_link_attributes=YES \
  -- -o out.xhtml \
  out.html

sgmlproc is instructed to activate our test link process by use of the active_lpd_names parameter, and will receive link attribute and link set declarations from the test.lpd file (sgmlproc looks for a file named after the implied LPD and automatically adds the .lpd file suffix).

The result in out.xhtml is the same as what we produced by adding the xmlns attribute via DTD attribute declarations, and will have a new xmlns attribute set to the XHTML namespace URI on html.

But with this setup, as opposed to our simpler initial model, we can now add additional conversion rules. For a start, we're going to transform the lang attribute into an xml:lang attribute, using whatever value was actually specified in the lang source attribute rather than hard-coding the value in a DTD or LPD attribute declaration. To do this, we're once again rewriting our setup to make use of SGML templating instead of relying on forward_link_attributes.

Producing XHTML using templates

Templating is a versatile SGML technique introduced with sgmljs.net SGML to replace content of source files with that of "template" SGML files at spaces specified in link rules or #CONREF attributes in content with type-safety and support for parameters. A template file to be inserted into result markup is a regular, standalone SGML file expected to parse as the element type which it replaces in source markup.

For example, to add an xml:lang attribute, the html element in our source document is targetted, and recreated using a template file such as the following:

<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY lang SYSTEM>
  ...
]>
<html xml:lang="&lang">
...

This template receives the lang parameter as SGML system-specific entity (declared without a system identifier) and references the entity to obtain its value in attribute content.

DOCTYPE #IMPLIED SYSTEM is a WebSGML feature for determining the document element name from the first content element in a markup stream, and for obtaining DTD declarations via a system-specific entity, resolved to the file XXX.dtd by sgmljs.net SGML where XXX is the name of the document element.

Actually, we could leave out the DTD here alltogether, since DOCTYPE #IMPLIED SYSTEM is assumed by default and IMPLYDEF ENTITY YES semantics expressed in the default SGML declaration allows references to undeclared (general) entities with their declaration implied to be system-specific (as per the WebSGML specification).

Let's take a look at the place where this template is applied in the source document:

<DOCTYPE html ...>
<!LINKTYPE xhtml [
  <!NOTATION tmpl
        PUBLIC
      "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
    "mytemplate.sgm">
  <!ATTLIST #NOTATION sgml
    lang CDATA #IMPLIED>
  <!ATTLIST html
    template NOTATION (tmpl) #IMPLIED>
  <!LINK #INITIAL html [ template=tmpl ]>
]>
<html lang="en-US">...
...

These declarations

establish the tmpl notation as an SGML file stored in mytemplate.sgm with the lang data attribute
declare a link attribute on the html element with declared value a notation (the tmpl notation)
set up a link rule to assign the tmpl notation to the notation link attribute on html

Using the SGML public identifier in this way makes sgmljs.net SGML apply the template on the html element.

At runtime, the value for the LANG data attribute is populated from the content attribute on the html element having the same name, as per the ISO 10744's DAFE specification.

But we're not ready yet: since we're actually targetting the document element in this particular case, we basically also re-create the whole source document, including child content of the source html element in the template file:

<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY LANG SYSTEM>
  <!ENTITY content SYSTEM "<osfd>0">
]>
<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="&LANG">
&content
</html>

We do so by declaring an additional entity resolving to the "standard input" (Unix file descriptor 0). <osfd>0 is a a Formal System Identifier and a notation to refer to the standard input character stream introduced by ISO 10744's FSIDR specification. sgmljs.net SGML, when executing a template processing sub-context on html, supplies the entire child content of the element on which the template is invoked via <osfd>0.

In effect, our template, up to adding xml:lang, is acting as an identity transform. We don't have anything special to do on the invoking site (in source markup), because sgmljs.net SGML populates <osfd>0 by default.

So lets put our template to work by storing the above snippet in mytemplate.sgm. Moreover, we create xhtml.lpd (which we're going to use to inject our LPD into out.html as described before with test.lpd) as follows:

<!NOTATION tmpl
  PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
  "mytemplate.sgm">
<!ATTLIST #NOTATION tmpl
  lang CDATA #IMPLIED>
<!ATTLIST html
  template NOTATION (tmpl) #IMPLIED
  lang CDATA #IMPLIED>
<!LINK #INITIAL html [ template=tmpl ]>

Now by invoking our sgmlproc commands again (the first one is unchanged from before, and the second one activates the xhtml link process rather than test):

./sgmlproc \
  -v output_format=html \
  -- -o out.html \
  blogpage-with-html52mini-doctype-and-custom-url.html

./sgmlproc \
  -v output_format=xml \
  -v dtd_handling=omit \
  -v sgmldecl_syntax_namecase_general=NO \
  -v active_lpd_names=xhtml \
  -v system_specific_implied_lpd_names=xhtml \
  -v forward_link_attributes=YES \
  out.html

we now obtain decent XML output we can feed into (hypothetical) tools for further processing of XHTML.

Advanced parsing and DTD customization

Parsing `download` attributes

download, according to WHATWG specs since at least 2015 (but not in W3C HTML as of 5.2), can be used both with, or without attribute value. In SGML terms, download can be used as an attribute with CDATA declared value, or can be used as a name token on <a> anchor elements:

<a href="..." download="...">
<a href="..." download>

In our canonical output markup, we either want to have download used as a regular attribute with a value (equal to href if missing in source markup, or otherwise with the specified value), or don't want to have download specified as attribute or name token at all.

To make HTML parsing behave as expected, the following things have to be done:

an attribute (isdownload, say) must be declared in addition to those already declared in the html52.dtd or html52mini.dtd; the isdownload attribute should be able to take on the download name token, such that using attribute-minimized form <a download> will be treated same as <a isdownload=download>
a link process needs to be applied, dispatching on various alternatives for specifying a download link (eg. with explicit value for download as opposed mere presence of the download name token)

Templating href attributes

While we have already customized the declaration of URI-typed attributes in our initial parsing examples above, there are additional considerations for templating URL attributes:

As part of templating <a> anchor elements, generally speaking, we'll be rewriting href values captured from source markup into entity references in attribute values specified in a template containing markup such as

<a href="&href ...>

As already explained, the href attribute is declared, via the URI parameter entity as a data specification attribute and thus treated as-is, rather than being entity-expanded. If we hadn't already change the declaration of %URI to CDATA above, we therefore had to redeclare the href attribute at this point anyway, such that &href will be recognized as an entity reference.

Comprehensive example

The following synthetic SGML document implements these customizations in its internal subset, and also includes an additional declaration for the download name token. Moreover, it includes a link process declaration for selecting an appropriate template based on whether

download is present as name token,
download isnt't present at all (neither as name token nor as attribute), or
download is present as attribute,

respectively (page-with-download-link-demo.sgm):

<!DOCTYPE html SYSTEM "html52mini.dtd" [
  <!ENTITY % if_uri_data_spec_attr "IGNORE">
  <!ATTLIST a
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED>
]>
<!LINKTYPE htmlfix html #IMPLIED [
  <!NOTATION a-isdownload
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
    "a-isdownload.sgm">
  <!ATTLIST #NOTATION a-isdownload
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED
    href CDATA #IMPLIED>
  <!NOTATION a-nodownload
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
    "a-nodownload.sgm">
  <!ATTLIST #NOTATION a-nodownload
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED
    href CDATA #IMPLIED>
  <!ATTLIST a
    template NOTATION (a-isdownload|a-nodownload) #IMPLIED
    isdownload (download) #IMPLIED
    download CDATA #IMPLIED
    href CDATA #IMPLIED>
  <!LINK #INITIAL
    a [ isdownload=download template=a-isdownload ]
    a [ download="" template=a-nodownload ]
    a [ template=a-isdownload ]>
]>
<html>
  <head>
    <title>Page containing minimized download attribute</title>
  </head>
  <body>
    <a href="/someurl" download>Download Link</a>
    <a href="/otherurl">Regular Link</a>
  </body>
</html>

The content of the a-isdownload.sgm and a-nodownload.sgm templates, respetively is:

<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY content SYSTEM "<osfd>0">
  <!ENTITY HREF SYSTEM>
]>
<a href="&HREF" download="&HREF">&content</a>

<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY content SYSTEM "<osfd>0">
  <!ENTITY HREF SYSTEM>
]>
<a href="&HREF">&content</a>

To invoke production of normalized <a> anchor elements from this document, invoke:

./sgmlproc \
  -v output_format=html \
  -v active_lpd_names=HTMLFIX \
  page-with-download-link-demo.html

The result markup wth normalized <a> anchor elements looks like this:

<html>
  <head>
    <title>Page containing minimized download attribute</title>
  </head>
  <body>
    <a download="/someurl" href="/someurl">Download Link</a>
    <a href="/otherurl">Regular Link</a>
  </body>
</html>

Refining `download` attribute handling

With this solution, we have changed the href attribute declaration globally, and for all URI-typed attributes. This customized declaration for URI attributes (like those for all other element and attributes) is propagated into the processing context for the template application on <a>.

We may want to restrict this interpretation to only href attributes in <a> elements, and to the template processing subcontext, rather than the primary parsing context.

For the first issue, we can alternatively preempt HTML.a.attlist to "IGNORE", and supply our own attribute declarations for <a>. For the second issue, we can restrict our modification to only the internal subset used in the template processing context for the template being applied on <a> elements.

For being able to influence the processing context used in a template application, we need to allow lax templating. Normally (in strict templating), sgmlproc checks that an SGML file used as template has <!DOCTYPE ... SYSTEM> as document prolog, where the declaration set resolved by SYSTEM is propagated by the calling main parsing context.

./sgmlproc \
  -v output_format=html \
  -v active_lpd_names=HTMLFIX \
  -v system_specific_implied_lpd_names=htmlfix \
  -v enable_lax_templates=YES \
  page-with-download-link.html

See page-with-downoad-link.html, htmlfix.lpd, a-isdownload-lax.sgm, and a-nodownload-lax.sgm implementing this variant. Note we, again, make use of the system_specific_implied_lpd_names feature to inject a link process declaration (with declarations from htmlfix.lpd) into the parsed HTML file.

Further customization

(left as an exercise)

consider validating variant types/content models of eg. input elements (where an attribute determines the content model and/or type of other attributes)
consider checking accessibility such as checking for presence of alt attributes and proper use of ARIA