SGML

Converting HTML to X(HT)ML

blogpage-with-html52mini-doctype.html
Sample HTML page (random blog page) from previous tutorial
blogpage-with-html5-doctype-and-modified-html51-dcl.sgm
Sample HTML for use with OpenSP
blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html
Sample HTML page with DTD-based adding of an xmlns attribute
xhtml.lpd
Declarations for a link process applying mytemplate.sgm on the document element (the html element)
mytemplate.sgm
SGML template recreating an input HTML document with added xml:lang attribute
Note: to execute tests in this directory, download the sgmlproc command-line app for Linux or Mac OS, or get sgmlproc by installing the SGML package for Node.js (in the latter case, unless the sgml package is installed globally, sgmlproc is invoked by using node_modules/.bin/sgmlproc on the command line)

Converting HTML to XML

Continuing with our running example for HTML parsing, we now want to produce XML from the input, to then feed it into some of the many available XML processing tools for further extraction or other processing.

With sgmljs.net, this can be done by using

./sgmlproc -v output_format=xml -v dtd_handling=omit blogpage-with-html52mini-doctype.html

and with OpenSP, by running

SGML_BCTF=utf-8 osx -E 1000 blogpage-with-html5-doctype-and-modified-html51-dcl.sgm

where osx is the OpenSP program specifically designed for XML conversion, and where we must increase OpenSP's threshold for errors to 1000 so that it doesn't prematurely abort processing due to too many character encoding errors.

With XML output serialization, elements with declared content EMPTY, such as the img, meta, and hr elements, will be output with end-element tags or XML-style empty-element tags.

Using dtd_handling=omit makes sgmlproc skip outputting a DOCTYPE declaration (which we must do because HTML DTDs are SGML DTDs, and can't be used with XML-only parsers).

Producing XML

So this is what the output of the sgmlproc command stated above looks like (with osx's output being similar):

<HTML LANG="EN-US"><HEAD>
    <META CONTENT="text/html; charset=UTF-8" HTTP-EQUIV="Content-Type">
    </META><TITLE>Doc Searls Weblog &#183; Holding forth on stuff since 1998</TITLE>
    <LINK HREF="//s0.wp.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.gravatar.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.w.org" REL="dns-prefetch"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/feed/" REL="alternate" TITLE="Doc Searls Weblog &#187; Feed" TYPE="application/rss+xml"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/comments/feed/" REL="alternate" TITLE="Doc Searls Weblog &#187; Comments Feed" TYPE="application/rss+xml"></LINK>
<SCRIPT TYPE="text/javascript"><![CDATA[
	window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/svg\/","svgExt":".svg","source":{"concatemoji":"http:\/\/blogs.harvard.edu\/doc\/wp-includes\/js\/wp-emoji-release.min.js?ver=4.8.1"}};
	...
]]></SCRIPT>
<STYLE TYPE="text/css">
	img.wp-smiley,
	img.emoji {
		display: inline !important;
		border: none !important;
		box-shadow: none !important;
		height: 1em !important;
		width: 1em !important;
		margin: 0 .07em !important;
		vertical-align: -0.1em !important;
		background: none !important;
		padding: 0 !important;
	}
</STYLE>
...
<META CONTENT="Holding forth on stuff since 1998" NAME="description">
</META><META CONTENT="all" NAME="robots">
</META><LINK HREF="http://gmpg.org/xfn/11" REL="profile">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/style.css" MEDIA="all" REL="stylesheet" TYPE="text/css">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/library/css/print.css" MEDIA="print" REL="stylesheet" TYPE="text/css">
...
</HEAD>

<BODY CLASS="home blog centre janus" ID="home">

<DIV CLASS="tarski" ID="wrapper">
    <DIV ID="header">
        <DIV ID="header-image"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page"><IMG ALT="Header image" SRC="http://blogs.harvard.edu/doc/files/2014/01/gregory_blogheader3.jpg"></IMG></A></DIV>

<DIV ID="title">
	<H1 ID="blog-title"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page">Doc Searls Weblog</A></H1>
	</DIV>
<DIV CLASS="clearfix" ID="navigation"><UL CLASS="primary xoxo" ID="menu-menu-1"><LI CLASS="menu-item menu-item-type-custom menu-item-object-custom menu-item-8399" ID="menu-item-8399"><A HREF="http://blogs.law.harvard.edu/doc/">Home</A></LI>
...

We note sgmlproc has properly generated end-element tags for elements declared EMPTY such as img, meta, and others, and has also put CDATA marked section markers around content containing < and & characters from elements having declared content CDATA such as script and style.

We can also use the xmllint command-line program (installed as part of libxml2 on Unix-like systems) to verify that the output of sgmlproc is indeed valid XML.

The result, while XML, leaves to be desired, however, since tag and attribute names are produced in uppercase letters. This is because the file is being processed with the HTML 5 SGML declaration (either implicitly by sgmlproc when processing .html files, or explicitly by specifying an SGML declaration at the begin of the file to parse), and the HTML 5 SGML declaration asserts SYNTAX NAMECASE GENERAL YES which will generally perform case-folding on element and attribute, and other name tokens.

To change this, we're going to parse our input document twice with sgmlproc, refeeding the first parse's output into the second run - the first time with the (implicit) SGML declaration for HTMl 5 as before, and by using output_format=html (which guarantees lowercase element and attribute names and other name tokens), and the second time with SYNTAX NAMECASE GENERAL NO and output_format=xml (note we leave out -v dtd_handling=omit from the first invocation, as opposed to our previous run):

./sgmlproc -v output_format=html blogpage-with-html52mini-doctype.html > out.html

./sgmlproc -v output_format=xml -v dtd_handling=omit -v sgmldecl_syntax_namecase_general=NO out.html > out.xml

We can use dtd_handling=omit on the second invocation to get rid of DTD declarations, which we aren't going to need anymore for parsing, since the first parse has taken care of normalizing enumerated attributes into canonical (XML-like) syntax, and the second parse has taken care of producing end-element tags for img and other elements with declared content EMPTY, and also of putting CDATA section markers around script and style content where necessary.

Note: we're leaving OpenSP behind here as it doesn't have these and other options we're using for conversion.

Producing XHTML

For XHTML proper, W3C's HTML 5.2 specification imposes a number of additional constraints on top of requiring generic XML syntax. We're going to focus on the first two items, and leave the other ones as an exercise:

  • the XHTML namespace must be asserted as default xmlns namespace binding for the document (or must be otherwise represented in a way compatible with XML namespaces)

  • HTML's lang attribute should be represented as xml:lang

  • the special handling of HTML's noscript element by browsers may make it desirable to remove noscript elements and their child content from XHTML-serialized HTML documents

  • HTML's href attribute on a base element, if any, should be propagated to an xml:base attribute on the html document element in order to make interpretation of relative URI values conformant with XML assumptions

  • if desired, HTML's id attribute(s) could be represented as xml:id to have XML validate uniqueness of identifiers without additional DTD declarations

  • XLink attributes on foreign elements in HTML (actuate, arcrole, href, role, show, title, type as xlink:actuate and so on) must be preserved

  • also, xmlns and xmlns:xlink attributes must be preserved (and, in general, HTML with embedded SVG and MathML must be handled, which we're not going to do here for space reasons)

Basic creation of XHTML

The first requirement - that of adding an XHTML namespace binding attribute to the html document - is easy enough and can be achieved by merely customizing the internal subset for the HTML 5.2 mini-DTD by using the following declaration in place of the one we've used before:

<!DOCTYPE html SYSTEM "about:legacy-compat" [
	<!ATTLIST html xmlns CDATA #FIXED "http://www.w3.org/1999/xhtml">
]>

This will just always add (or enforce if present) an additional xmlns attribute on the document element. To make use of it, we're editing blogpage-with-html52mini-doctype.html as described, save the result as blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html, and invoke

./sgmlproc -v output_format=html -v dtd_handling=omit blogpage-with-html52mini-doctype-and-xhtml-bs-binding.html > out.html

in place of our earlier invocation.

While this might be enough for basic documents, for other XHTML rules we're going to have to use additional SGML concepts for describing markup transformations (we could of course also use XML-centric tools such as XSLT to do the same at this point).

SGML link process declarations (LPDs) are an additional type of declaration set (in addition to document type declarations) supported by SGML.

LPDs in the context of a larger SGML prolog can look as follows:

<!DOCTYPE d [
	<!ELEMENT d ... >
	<!EEMENET e ... >
]>
<!LINKTYPE l d #IMPLIED {
	<!ATTLIST e l ...>
	<!LINK #INITIAL e [ l="..." ]>
]>

where the LPD l is declared as an implicit link process associating link attributes to elements declared in the d document type. If it were declared as an explicit link process instead, the link process would take source markup according to the d DTD, and produce target markup according to another DTD as specified in place of #IMPLIED in the link declaration.

The <LINK ... link set declaration in the example establishes a value for the l link attribute on e elements unconditionally. Note that link attributes (those declared in a link process declaration) are not exposed as content attributes by SGML. Rather, they are meant to be supplied to "the SGML application" (meaning a document renderer) in an unspecified way. Link attributes are similar to CSS properties in that they influence presentation, but are not as such part of the regular document markup stream. Link attributes can also be conditionally associated to content elements based on context, much like CSS selectors.

When using OpenSP, the associated values of link attributes can be made visible using the onsgmls program which produce ESIS output from markup (ESIS is a line-oriented markup representation for easy processing with classic Unix shell tools and Perl, and is also used in SGML test suites as reference output).

To produce the xmlns attribute for the XHTML using SGML LINK with sgmlproc, we can make use of sgmlprocs forward_link_attributes to ouput link attributes as regular content attributes.

We're going to declare our link process for producing XHTML in an extra file (as external entity) rather than inline in the HTML input document because we want to apply a link process at the second stage (on the output of the initial html parse). We're going to use another special command-line flag available to sgmlproc specifically designed to behave as if a particular link process we're giving as parameter is declared in our document when it's not actually contained in the SGML prolog.

We're not going to use the xmlns custom attribute technique explained in the previous section since we want to use LPDs for this purpose; hence we're working on our basic blogpage-with-html52mini-doctypebinding.html again.

Our initial link process, test.lpd, looks as follows:

<!ATTLIST html xmlns CDATA #IMPLIED>
<!LINK #INITIAL html [ xmlns="http://www.w3.org/1999/xhtml" ]>

We invoke the first sgmlproc run as already explained above, and make use of our special command-line flags in the second sgmlproc invocation (note there's also a test.html file to check the second command on a much smaller input):

./sgmlproc -v output_format=html blogpage-with-html52mini-doctype.html > out.html

./sgmlproc -v output_format=xml \
	-v dtd_handling=omit \
	-v sgmldecl_syntax_namecase_general=NO \
	-v active_lpd_names=test \
	-v system_specific_implied_lpd_names=test \
	-v forward_link_attributes=YES \
	out.html > out.xhtml

sgmlproc is instructed to activate our test link process by use of the active_lpd_names parameter, and will receive link attribute and link set declarations from the test.lpd file (sgmlproc looks for a file named after the implied LPD and automatically adds the .lpd file suffix).

The result in out.xhtml is the same as what we produced by adding the xmlns attribute via DTD attribute declarations, and will have a new xmlns" attribute set to the XHTML namespace URI onhtml`.

But with this setup, as opposed to our simpler initial model, we can now add additional conversion rules. For a start, we're going to transform the lang attribute into an xml:lang attribute, using whatever value was actually specified in the lang source attribute rather than hard-coding the value in a DTD or LPD attribute declaration. To do this, we're once again rewriting our setup to make use of SGML templating instead of relying on forward_link_attributes.

Producing XHTML using templates

Templating is a versatile SGML technique introduced with sgmljs.net SGML to replace content of source files with that of "template" SGML files at spaces specified in link rules or #CONREF attributes in content with type-safety and support for parameters. A template file to be inserted into result markup is a regular, standalone SGML file expected to parse as the element type which it replaces in source markup.

For example, to add an xml:lang attribute, the html element in our source document is targetted, and recreated using a template file such as the following:

<!DOCTYPE #IMPLIED SYSTEM [
	<!ENTITY lang SYSTEM>
	...
]>
<html xml:lang="&lang">
...

This template receives the lang parameter as SGML system-specific entity (declared without a system identifier) and references the entity to obtain it's value in attribute content.

DOCTYPE #IMPLIED SYSTEM is a WebSGML feature for determining the document element name from the first content element in a markup stream, and for obtaining DTD declarations via a system-specific entity, resolved to the file XXX.dtd by sgmljs.net SGML where XXX is the name of the document element.

Actually, we could leave out the DTD here alltogether, since DOCTYPE #IMPLIED SYSTEM is assumed by default and IMPLYDEF ENTITY YES semantics expressed in the default SGML declaration allows references to undeclared (general) entities with their declaration implied to be system-specific (as per the WebSGML specification).

Let's take a look at the place where this template is applied in the source document:

<DOCTYPE html ...>
<!LINKTYPE xhtml [
	<!NOTATION tmpl
	  SYSTEM "mytemplate.sgm"
              PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
	<!ATTLIST #NOTATION sgml
		lang CDATA #IMPLIED>
	<!ATTLIST html
		template NOTATION (tmpl) #IMPLIED>
	<!LINK #INITIAL html [ template=tmpl ]>
]>
<html lang="en-US">...
...

These declarations

  • establish the tmpl notation as an SGML file stored in mytemplate.sgm with the lang data attribute

  • declare a link attribute on the html element with declared value a notation (the tmpl notation)

  • set up a link rule to assign the tmpl notation to the notation link attribute on html

Using the SGML public identifier in this way makes sgmljs.net SGML apply the template on the html element.

At runtime, the value for the lang data attribute are populated from the content attribute on the html eelement having the same name, as per the ISO 10744's DAFE specification.

But we're not ready yet: since we're actually targetting the document element in this particular case, we basically also re-create the whole source document, including child content of the source html element in the template file:

<!DOCTYPE #IMPLIED SYSTEM [
	<!ENTITY lang SYSTEM>
	<!ENTITY content SYSTEM "<osfd>0">
]>
<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="&lang">
&content
</html>

We do so by declaring an additional entity resolving to the "standard input" (Unix file descriptor 0). <osfd>0 is a a Formal System Identifier and a notation to refer to the standard input character stream introduced by ISO 10744's FSIDR specification. sgmljs.net SGML, when executing a template processing sub-context on html, supplies the entire child content of the element on which the template is invoked via <osfd>0.

In effect, our template, up to adding xml:lang, is acting as an identity transform. We don't have anything special to do on the invoking site (in source markup), because sgmljs.net SGML populates <osfd>0 by default.

So lets put our template to work by storing the above snippet in mytemplate.sgm. Moreover, we create xhtml.lpd (which we're going to use to inject our LPD into out.html as described before using test.lpd) as follows:

<!NOTATION tmpl
  PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
  "mytemplate.sgm">
<!ATTLIST #NOTATION tmpl
  lang CDATA #IMPLIED>
<!ATTLIST html
  template NOTATION (tmpl) #IMPLIED
  lang CDATA #IMPLIED>
<!LINK #INITIAL html [ template=tmpl ]>

Now by invoking our sgmlproc commands again (the first one is unchanged from before, and the second one activates the xhtml link process rather than test):

./sgmlproc -v output_format=html blogpage-with-html52mini-doctype.html > out.html

./sgmlproc -v output_format=xml \
	-v dtd_handling=omit \
	-v sgmldecl_syntax_namecase_general=NO \
	-v active_lpd_names=xhtml \
	-v system_specific_implied_lpd_names=xhtml \
	-v forward_link_attributes=YES \
	out.html > out.xhtml

we now obtain a decent out.xhtml file we can feed into (hypothetical) tools for further processing of XHTML.