SGML

Producing HTML

producing-html-tutorial.tgz
Download all files
sgmlproc
sgmlproc command-line app/script for Linux and Mac OS; note this file must be made executable after download using "chmod +x sgmlproc"
content-using-entity-references.sgm
Content file using (parsed) entity expansion for boilerplate inclusion
footer.ent
Footer boilerplate as text data
header.ent
Header boilerplate as text data
content-using-conref-templating.sgm
Content file using templating with #CONREF entities for boilerplate inclusion
footer.ent
Footer boilerplate as SGML document (with DOCTYPE)
header.ent
Header boilerplate as SGML document (with DOCTYPE)
content-using-conref-templating-emptynrm.sgm
Content file using templating with #CONREF entities with omitted end-element tags on header and footer elements
markdown-emph.sgm
Basic use of short references to produce span-level markup
markdown-headings.sgm
Simplistic use of short references to produce headings and paragraphs
markdown-headings-builtin.sgm
Producing headings with build-in markdown parsing embedded in SGML
markdown-headings-builtin.md
Producing headings from regular markdown (.md) file
outlining1.sgm
Producing section elements from heading elements
outlining2.sgm
Producing skeletal navigation links from heading elements
tocdoc1.sgm
Example for a basic template adding a table-of-content
tocdoc2.sgm
As before, but using an entity to pull-in content instead
tocdoc-content.sgm
Separated-out content file used by tocdoc2.sgm
doc1.sgm
Preliminary page template for hierarchical text rendering
doc.sgm
Final page template for hierarchical text rendering
anchor.sgm
Link decoration template applied on anchor elements
doc/content1
Content file using regular SGML syntax
doc/markdowncontent1
Content file using markdown syntax
Note: this tutorial requires the Node.js SGML package for the sgmlproc command-line app and for serving HTML and SGML on the web. To make use of it, you first need to install Node.js, making sure that the node and npm command line apps that come with Node.js can be accessed by typing eg. npm on the command-line in a terminal, then create a fresh directory and change into it, and then install SGML by invoking npm install -g sgml on the command line.

Introduction

This tutorial gives an introduction to building basic websites from HTML and markdown text fragments with the help of light content extraction and other SGML transformation techniques for generating page navigation.

Composing HTML documents

As a simplistic example for organizing web content around sharing common content, we're going to add header and footer content boilerplate to an SGML file that we indend to publish as HTML on the web. The expectation here is that we're going to have multiple pages, each sharing common head metadata, header (with eg. a menu), and footer content (with eg. legal notices), and similar shared content.

So this is what our produced content file(s) should roughly look like:

<html>
  <head>
    <title> ... </title>
  </head>
  <body>
    <header> ... </header>
    <main> ... </main>
    <footer> ... </footer>
  </body>
</html>

where we want to have boilerplate for head, header, and footer populated by using SGML, and keep actual content files free from redundant head, header, and footer elements. Instead, we want our content files to look as follows:

<title>The title</title>
<p>Body text</p>

A simple way to do this (with just header and footer content for now) is storing header and footer content in separate files and using SGML general entities to pull content in from those files into our main content file(s) (content-using-entity-references.sgm):

<!DOCTYPE html [
  <!ENTITY header SYSTEM "header.ent">
  <!ENTITY footer SYSTEM "footer.ent">
]>
<html>
  <head>
    <title>The title</title>
  </head>
  <body>
    &header
    <p>Body text</p>
    &footer
  </body>
</html>

where header.sgm contains HTML text such as

<header>
  <h1>My Site</h1>
</header>

and footer.sgm contains eg.

<footer>
  <p>Copyright by me;</p>
  <p>Contact: abuse@mysite.com</p>
</footer>

To preview the result HTML, use

sgmlproc content-using-entity-references.sgm

to produce the expected, completely assembled HTML file in a terminal on the command-line.

Serving SGML as HTML on the web

To serve SGML files in our directory on the web, start sgmljs.net SGML's default web app:

sgmlweb-app

You'll then be able to open a web browser and point it at http://localhost:8080/content-using-entity-references to see our humble first web page, as assembled dynamically at request processing time.

Taking a closer look at node_modules/sgml/sgmlweb-app.js shows it just acts as an express.js "middleware":

/**
 * SGML web server app using expressjs.
 */
var express = require('express')
var sgml = require('sgml')

var app = express()

// content rendering
app.use(function(req, res, next) {
	sgml.middleware()(null, req, res, next)
})

// error page rendering
app.use(function(err, req, res, next) {

	// only render error page on error status codes;
	// (eg. neither on 304 Not Modified, redirects, or when
	// primary middleware rendering has encountered an error during
	// content parsing/processing and has already set
	// 200 status/sent headers; in the latter case,
	// we want expressjs' finalhandler to just close the socket
	// instead)
	if (res.statusCode < 400 || res.statusCode > 599) {
		next(err)
		return
	}
	req.method = 'GET'
	req.pathTranslated = ''
	req.pathInfo = ''
	req.scriptName = process.cwd() + '/error.sgm'
	req.url = ''
	req.queryString = ''
	req.queryStringDecoded = ''
	sgml.middleware()(err, req, res)
})

app.listen(8080)

Actually, it mounts sgml.middleware twice into the rendering pipeline, the first call sgml.middleware()(null, req, res, next) being executed in response to a regular content request, and the second one only executed to render an error page if regular rendering failed. Note we don't have error.sgm in place in our directory, so that second attempt will always fail, and just send an empty page to the browser.

We can test this by requesting a non-existant page. For example, let's open http://localhost:8080/nonexistant, which will make sgmlweb-app respond with an empty page. Now if we create an error page, error.sgm, in the tutorial dir from where we're running sgmlweb-app with the following content:

<!DOCTYPE html SYSTEM "about:legacy-compat" [
    	<!ENTITY STATUS SYSTEM>
]>
<html>
<head>
  <title>Error &STATUS</title>
</head>
<body>
Error serving requested page
</body>
</html>

and attempt to reload http://localhost:8080/nonexistant, we'll receive a proper error page having "Error 404" in the title, and we could elaborate our basic error page to include whatever content we whish; right now, it only receives STATUS containing the numerical HTTP status 404 (for "NOT FOUND") as a system-specific entity.

As you can see, sgmlweb-app is just a demo app using Node.js and express.js in the most straightforward way; for running productive websites on sgmljs.net SGML and Node.js, you may want to refer to express.js' documentation for configuring security settings such as SSL/https keys, etc.

Parsing markdown

Parsing markdown using short references

Custom Wiki syntaxes such as markdown are as old as digital text processing itself. SGML lets you define element context-specific token replacement rules for this purpose. For example, to make SGML format a simplistic markdown fragment into HTML, you could use an SGML prolog like this (markdown-emph.sgm):

<!DOCTYPE p [
  <!ELEMENT p - - ANY>
  <!ELEMENT em - - (#PCDATA)>
  <!ENTITY start-em '<em>'>
  <!ENTITY end-em '</em>'>
  <!SHORTREF in-p '*' start-em>
  <!SHORTREF in-em '*' end-em>
  <!USEMAP in-p p>
  <!USEMAP in-em em>
]>
<p>The following text:
   *this*
   will be put into EM
   element tags</p>

If processed with sgmlproc eg.

sgmlproc markdown-emph.sgm

SGML will produce canonical syntax as follows:

<p>The following text:
   <em>this</em>
   will be put into EM
   element tags</p>

This works by declaring, via SHORTREF short reference maps (in-p and in-em) associating tokens (the * asterisk token in both rules) to replacement entities, and then make those maps active via USEMAP short reference use declarations in a given element context.

If the context (top-most) element is em, the in-em shortref map is current (as per the second USEMAP declaration), which defines the replacement text for * to be </em>, ending the emphasized text span. Whereas within p, it's <em>, starting an emphasized text span, and making em the context element.

As a slight variation, h2 heading elements can be produced from text enclosed in double-hashmark (##) characters, as used in markdown syntax, with p paragraph elements being added by markdown formatting:

<!DOCTYPE body [
  <!ELEMENT body O O ((h2,p)+)>
  <!ELEMENT p O O (#PCDATA)>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ENTITY start-h2 '<h2>'>
  <!ENTITY end-h2 '</h2>'>
  <!SHORTREF in-body '##' start-h2>
  <!SHORTREF in-h2 '##' end-h2>
  <!USEMAP in-body body>
  <!USEMAP in-h2 h2>
]>
<body>

## Heading 1 ##

Body text of first section.

</body>

Parsing full markdown syntax using sgmljs.net SGML

For formatting full markdown with all bells and whistles as known from popular sites such as github.com, sgmljs.net SGML has built-in short reference rules that, when referenced (included) in the base document type declaration via a parameter entity, will make sgmljs.net SGML recognize and format unrestricted markdown into HTML as expected:

<!ENTITY % md_shortref_maps
  PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
%md_shortref_maps;

This declares the md_shortref_maps entity to contain short references rules for full markdown via the public identifier (symbolic name) +//IDN sgmljs.net//SHORTREF Markdown//EN, and then references the entity such that it becomes part of the markup declarations in which it is referenced, acting as if it were declared in place of the reference much like general entities for content we've used above.

The former example, rewritten to make use of built-in shortref rules for markdown, looks as follows (markdown-headings-builtin.sgm):

<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
<!DOCTYPE body [
  <!ELEMENT body O O ((h2,p)+)>
  <!ELEMENT p O O (#PCDATA)>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ENTITY % md_shortref_maps
    PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
  %md_shortref_maps;
]>
<body>

## Heading 1 ##

Body text of first section.

</body>

The first line is an SGML declaration reference we need to include such that sgmlproc assumes availability of short reference delimiters needed for markdown and HTML naming rules in a way that is compatible with third-party SGML software.

Pulling-in md_shortref_maps will enable comprehensive markdown formatting. Note there's no actual short reference declaration set being resolved by the +//IDN sgmljs.net//SHORTREF Markdown//EN public identifier; these declarations are resolved/recognized specially by sgmljs.net SGML and are implemented using an internal markdown-to-HTML converter. The purpose of presenting markdown formatting as short reference application is provided for uniformity and compatibility with third-party SGML software.

Note sgmlproc includes these definitions by default when processing files having an .md file name suffix. We can omit including an SGML declaration if we rename our file to process such that it has an .md file suffix, in which case the necessary SGML declaration settings will be automatically assumed by sgmlproc.

For example, markdown-headings-builtin.md looks like this:

## Heading 1 ##

Body text of first section.

and when processed via

sgmlproc markdown-headings-builtin.md

will be formatted into:

<h2 id="heading-1">Heading 1
</h2><p>Body text of first section.
</p>

HTML Outlining

This example demonstrates how to automatically create an outline from basic HTML sectioning and/or heading elements using link processing and templating.

An outline is useful for generating a table of content, for assistive technologies, and for generation of page navigation elements. Specifically, given a source HTML document similar to the following HTML markup not making use of HTML5's sectioning elements

<h2 id="heading-a">A Level Two Heading</h2>
<p>Level Two Content</p>
<p>Other Level Two Content</p>
<h2 id="heading-b">Another Level Two Heading</h2>
<p>Yet other Level Two Content</p>

we want to create a <nav> element as follows

<nav>
  <ul>
    <li><a href="#heading-a">A Level Two Heading</a></li>
    <li><a href="#heading-b">Another Level Two Heading</a></li>
  </ul>
</nav>

Later on, we also want to compose the result <nav> element with the source content into a compound HTML document such that source content appears as main content, and generated <nav> content as side-navigation (or top-navigation) content.

Producing sectioning roots from headings

HTML 5 has introduced sectioning elements as a means to hierarchically structure documents, where earlier HTML versions had only ranked heading elements for representing hierarchy ("flat-earth markup").

When sectioning elements are used, the markup for a heading element and the belonging body text, as well as potential subsections, have a common ancestor element, the sectioning root (a section, main, article or other element acting as sectioning root).

<section>
  <h2>Section heading</h2>
  <p>Section content text</p>
  <!-- potential subsections here ... -->
</section>
<section>
  <h2>Next section heading</h2>
  <p>Other content</p>
  <!-- potential subsections here ... -->
</section>

Traditional "flat-earth HTML markup" doesn't require a common (sectioning or other) element structurally enclosing the heading and its belonging section content:

<h2>Section heading</h2>
<p>Section content text</p>
<!-- ... --->
<h2>Next section heading</h2>
<p>Other content</p>
<!-- ... --->

sgmljs.net SGML is designed to be used with markdown text. Markdown doesn't have Wiki markup for sectioning as such, but, like earlier versions of HTML, for heading elements only. To impose sectioning structure onto markdown text explicitly, section (or other sectioning root) elements would have to be specified as HTML blocks within markdown text such as in the following example:

<section>

# Heading #

Markdown text with enclosing sectioning root
as markup block

</section>

This is however redundant and rarely seen in practice.

Therefore, for producing outlines from markdown or other HTML source without sectioning structure, we're using SGML to infer (ranked) section tags by parsing HTML with a custom DTD as straightforward as (outlining1.sgm):

<!DOCTYPE html [
  <!ELEMENT html O O (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ELEMENT li - - (ul*)>
  <!ELEMENT ul - - (#PCDATA)>
]>
<html>
  <h2>Section One Heading</h2>
  <p>Section One Body Text</p>
  <h2>Section Two Heading</h2>
  <p>Section Two Body Text</p>
  <h3>Subsection Two Dot Two Heading</h3>
  <p>Subsection Two Dot Two Body Text</p>
</html>

The parsing result contains inferred section2 and section3 elements as follows:

<html>
  <section2>
    <h2>Section One Heading</h2>
    <p>Section One Body Text</p>
  </section2>
  <section2>
    <h2>Section Two Heading</h2>
    <p>Section Two Body Text</p>
    <section3>
      <h3>Subsection Two Dot Two Heading</h3>
      <p>Subsection Two Dot Two Body Text</p>
    </section3>
  </section2>
</html>

This works because sgmljs.net SGML infers start-element tags for section2 and section3 section markers when seeing h2 and h3 elements, respectively, as directed by html's and section2 content models.

Note in order to obtain proper HTML, the rank suffixes for section2 and section3 would have to be removed (using straightforward renaming into plain section elements in a link process). This isn't shown here in detail however, since for our use case we don't want to produce sectioning elements as such, but want to use sectioning elements only as intermediate markup for producing navigation link markup from it, as shown next.

Generation of nav-links into an ul container element involves inferring sectioning from heading elements as shown above in a first step, followed by transforming sectioning structure into nested li and ul elements (outlining2.sgm):

<!DOCTYPE html [
  <!ELEMENT html O O (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ELEMENT li - - (ul*)>
  <!ELEMENT ul - - (#PCDATA)>
]>
<!DOCTYPE ul [
  <!ELEMENT nav O O (ul)>
  <!ELEMENT ul (li+)>
  <!ELEMENT li (a,ul*)>
  <!ELEMENT a (#PCDATA)>
]>
<!LINKTYPE toc html ul [
  <!LINK #INITIAL
    html ul
    section2 #USELINK in-section2 li>
  <!LINK in-section2
    h2 a
    section3 #USELINK before-section3 ul>
  <!LINK before-section3 #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3 h3 a>
]>
<html>
  <h2>Section One Heading</h2>
  <p>Section One Body Text</p>
  <h2>Section Two Heading</h2>
  <p>Section Two Body Text</p>
  <h3>Subsection Two Dot Two Heading</h3>
  <p>Subsection Two Dot Two Body Text</p>
</html>

The toc link process transforms html into ul, and (inferred) section2 elements into li elements. On section2, the in-section2 link set is made active which will generate <a> anchors from h2 headings, and produce content text for the anchor from heading text. Furthermore, on section3 subsection elements, a nested ul list is opened, and then the before-section3 link set immediately generates a li element on a virtual #IMPLIED element (according to sgmljs.net SGML's handling of link rules with #IMPLIED source elements before proceeding to transform headings into <a> anchors.

We can test our example document on the command line by invoking

sgmlproc -v active_lpd_names=TOC outlining2.sgm

and will see an HTML list containing the heading texts as list items, preserving the hierarchical nesting structure:

<ul>
  <li><a>Section One Heading</a></li>
  <li><a>Section Two Heading</a>
    <ul>
      <li><a>Subsection Two Dot Two Heading</a></li>
    </ul>
  </li>
</ul>

Note while we have created <a> anchor elements, we haven't yet created href attributes for those anchor elements to link to the respective section in body text. We'll come back to this later to keep example code text small for now.

Creating page templates

Now it's nice that SGML can produce an HTML nav-list from a document's outline, but we want to have the produced nav-list and the document body from which it was produced in the same document. To do so, our document must essentially contain source markup twice:

  • the first time for applying filtering on it to produce nav-links as described, and
  • the second time to actually contain the document full text

(assuming we want our rendered HTML to have an in-page document outline before actual main content). We're going to literally include content twice in the following example, but will soon turn to use entity references to avoid this redundancy.

So that we can still apply rank-based tag inference we're using a different result markup declaration set for the source and result markup, respectively: the declarations of htmlsource integrate our content model rules used before for tag inference below nav elements; the declarations of the result document type html, on the other hand, admit HTML elements being used freely (tocdoc1.sgm):

<!DOCTYPE htmlsource [
  <!ELEMENT htmlsource O O ANY>
  <!ELEMENT nav - - (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
]>
<!DOCTYPE html [
  <!ELEMENT html - - ANY>
  <!ELEMENT nav - - ANY -(p)>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ELEMENT h3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
]>
<!LINKTYPE toc htmlsource html [
  <!LINK #INITIAL
      htmlsource html
      nav #USELINK in-nav nav
      h2 h2
      h3 h3
      p p>
  <!LINK in-nav
      #IMPLIED #USELINK in-nav2 ul>
  <!LINK in-nav2
      section2 #USELINK in-section2 li>
  <!LINK in-section2
      h2 a
      p #USELINK #EMPTY #IMPLIED
      section3 #USELINK before-section3 ul>
  <!LINK before-section3
      #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3
      h3 a
      p #USELINK #EMPTY #IMPLIED>
]>
<htmlsource>
  <nav>
    <h2>Section One Heading</h2>
    <p>Section One Body Text</p>
    <h2>Section Two Heading</h2>
    <p>Section Two Body Text</p>
    <h3>Subsection Two Dot Two Heading</h3>
  </nav>
  <h2>Section One Heading</h2>
  <p>Section One Body Text</p>
  <h2>Section Two Heading</h2>
  <p>Section Two Body Text</p>
  <h3>Subsection Two Dot Two Heading</h3>
  <p>Subsection Two Dot Two Body Text</p>
</htmlsource>

The toc link process adapts our link rules explained above within top-level nav content in the in-nav link rules and rules reached from it.

There are two additional link rules of the form : p #USELINK #EMPTY #IMPLIED, on the in-section2 and in-section3 link sets, respectively, necessary here to filter-out paragraph elements from result content. Moreover, we add an exclusion exception -(p) to the nav element declaration. Together, these changes make the link process skip p paragraph elements within nav content because the result element of the rule is #IMPLIED, meaning the element is only copied over to result markup if allowed at the context position, which paragraph elements are not because they're excluded via the -(p) content exception for nav.

To now eliminate having to redundantly specify our <h2>Section One Heading</h2><p>... content text twice in the document, we're replacing each occurence with an entity reference &content, store content in the file tocdoc-content.ent, and declare the content entity accordingly (tocdoc2.sgm):

<!DOCTYPE htmlsource [
  <!ELEMENT htmlsource O O ANY>
  <!ELEMENT nav - - (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ENTITY content SYSTEM "tocdoc-content.ent">
]>
<!DOCTYPE html [
  <!ELEMENT html - - ANY>
  <!ELEMENT nav - - ANY -(p)>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ELEMENT h3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
]>
<!LINKTYPE toc htmlsource html [
  <!LINK #INITIAL
      htmlsource html
      nav #USELINK in-nav nav
      h2 h2
      h3 h3
      p p>
  <!LINK in-nav
      #IMPLIED #USELINK in-nav2 ul>
  <!LINK in-nav2
      section2 #USELINK in-section2 li>
  <!LINK in-section2
      h2 a
      p #USELINK #EMPTY #IMPLIED
      section3 #USELINK before-section3 ul>
  <!LINK before-section3
      #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3
      p #USELINK #EMPTY #IMPLIED
      h3 a>
]>
<htmlsource>
  <nav>
    &content
  </nav>
  &content
</htmlsource>

We may want to confirm on the command line that processing tocdoc1.sgm and tocdoc2.sgm produce the same result by invoking

sgmlproc -v active_lpd_names=TOC tocdoc1.sgm

and

sgmlproc -v active_lpd_names=TOC tocdoc2.sgm

respectively.

Note we now basically have a page template which will work with multiple individual content documents to be used in place of content. We only have to change the declaration of the content entity into this:

<!ENTITY content SYSTEM>

(eg. without a filename) to have SGML treat it as system-specific entity resolved to a file named content by default, or whatever value else we supply to it as customized system-specific entity for content.

sgmlweb has built-in support for resolving files from HTTP request URLs as follows: if

http://[template]/[content]

is requested, where [template] and [content] refer to existing files [template].sgm and a directory/file in [template]/[content].sgm, then it will process the request by producing HTML from the [template].sgm file, with [template]/[content].sgm (as filename) being supplied as value of the PATH_TRANSLATED system-specific entity, and the content of [template]/[content].sgm supplied as PATH_TRANSLATED_CONTENT.

Note PATH_TRANSLATED is the name of a meta-variable supplied by web servers to CGI web modules according to the CGI specification, and is also used by JSGI/connect/express.js (as pathTranslated) for portable JavaScript web middleware modules.

So to make our page template fit for using it directly in web templating, we choose to rename our content entity into PATH_TRANSLATED_CONTENT, eg. we're changing

<!ENTITY content SYSTEM>
...
&content
...

into

<!ENTITY PATH_TRANSLATED_CONTENT SYSTEM>
...
&PATH_TRANSLATED_CONTENT
...

Moreover, we add

<!ENTITY PATH_TRANSLATED SYSTEM>

to have access to the requested file name (the portion following http://localhost:8080/doc/ in our request URL).

The Download all files link (see above) links to a file archive where all files are put into the proper places according to sgmlweb's URL mapping rules to make our template run directly as page template.

To be able to use markdown syntax instead of our explicit <h2>Section One Heading</h2><p>... text, as already explained above, we just have to enable markdown processing in the template file by adding the markdown SGML declaration reference:

<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">

and by referencing markdown shortref declarations in the base document type:

<!ENTITY % md_shortref_maps PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
%md_shortref_maps;

We also must respect markdown delimiter recognition by adding blank lines before and after references to &PATH_TRANSLATED, by placing &PATH_TRANSLATED at the begin of lines, and also placing the </nav> element at the begin of a line

With these changes, this is what our page template looks like at this point (doc1.sgm):

<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
<!DOCTYPE htmlsource [
  <!ELEMENT htmlsource O O ANY>
  <!ELEMENT nav - - (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ENTITY PATH_TRANSLATED SYSTEM>
  <!ENTITY PATH_TRANSLATED_CONTENT SYSTEM>
  <!ENTITY % md_shortref_maps PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
  %md_shortref_maps;
]>
<!DOCTYPE html [
  <!ELEMENT html - - ANY>
  <!ELEMENT nav - - ANY -(p)>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ELEMENT h3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
]>
<!LINKTYPE toc htmlsource html [
  <!LINK #INITIAL
      htmlsource html
      nav #USELINK in-nav nav
      h2 h2
      h3 h3
      p p>
  <!LINK in-nav
      #IMPLIED #USELINK in-nav2 ul>
  <!LINK in-nav2
      section2 #USELINK in-section2 li>
  <!LINK in-section2
      h2 a
      p #USELINK #EMPTY #IMPLIED
      section3 #USELINK before-section3 ul>
  <!LINK before-section3
      #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3
      h3 a
      p #USELINK #EMPTY #IMPLIED>
]>
<htmlsource>
<nav>

&PATH_TRANSLATED_CONTENT

</nav>

&PATH_TRANSLATED_CONTENT

</htmlsource>

We may, again, verify our template works as expected on the command line by invoking sgmlproc on doc1.sgm while also supplying a value for PATH_TRANSLATED_CONTENT:

sgmlproc -v active_lpd_names=TOC -- -e 'PATH_TRANSLATED_CONTENT=<osfile>tocdoc-content.ent' doc1.sgm

Note the <osfile>... syntax we're using for specifying the content of tocdoc-content.ent as value for resolving PATH_TRANSLATED_CONTENT as system-specific entity is a Formal System Identifier (specified by the HyTime extensions to SGML). Note also we don't need to specify a value for PATH_TRANSLATED here since we're not actually referencing it in content; merely declaring it as general entity won't in itself make sgmljs.net SGML de-reference and open it.

Generating link decorations using templating

As promised further above, we now also want to implement our outline to actually contain nav-links to the sections of our body text, since our <a> anchor elements don't contain any href attributes at all yet. To do this, we first must use HTML id attributes on our headings as target links, and then somehow grab those attributes and place them into our <a> anchor links.

For the first problem, note that sgmljs.net SGML markdown, like many other markdown implementations such as pandoc, generates id attributes from heading text. For example, markdown text such as

# My Heading #

markdown body text

gets converted into the following HTML fragment:

<h1 id="my-heading">My Heading</h1>
<p>markdown body text</p>

Composing HTML using templating

For the second problem of forwarding id attribute values into href attributes, we're going to use templating with sgmljs.net SGML as a more sophisticated technique for pulling-in content from external files.

Recall that in our initial, basic HTML composition example, we've used SGML general entities to supply replacement text for header and footer content.

Rewriting our basic composition example to make use of templating looks as follows (content-using-conref-templating.sgm):

<!DOCTYPE html [
  <!ATTLIST header ref ENTITY #CONREF>
  <!ATTLIST footer ref ENTITY #CONREF>
  <!ENTITY header SYSTEM "header.sgm">
  <!ENTITY footer SYSTEM "footer.sgm">
]>
<!LINKTYPE web html #IMPLIED [
  <!NOTATION sgml
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
  <!ENTITY header SYSTEM "header.sgm" NDATA sgml>
  <!ENTITY footer SYSTEM "footer.sgm" NDATA sgml>
  <!LINK #INITIAL [ ]>
]>
<html>
  <head>
    <title>The title</title>
  </head>
  <body>
    <header ref=header></header>
    <p>Body text</p>
    <footer ref=footer></footer>
  </body>
</html>

This variant

  • specifies replacement text files via ENTITY attributes,

  • declares the ref attribute to have #CONREF semantics, such that SGML expects the element to have no syntactical content

  • has entities for header and footer declared in both the base document type declaration, as well as preempted (overriding) entity declarations for these in the web link process declarations as SGML entities (eg. declared as data entities with the SGML public identifier),

  • has system identifiers (filenames) of entities end in .sgm (so different files are being addressed from those used in the first variant), with header.sgm and footer.sgm meant to include a doument type declaration.

For example, header.sgm can look as follows:

<!DOCTYPE #IMPLIED SYSTEM>
<header>
  <h1>My Site</h1>
</header>

(and similarly, footer.sgm has <!DOCTYPE #IMPLIED SYSTEM> as well, extending our former header.ent and footer.ent files into stand-alone SGML files).

<!DOCTYPE #IMPLIED ...> means that the document element is the first element actually encountered in the file.

Moreover, SYSTEM in this context means that the content of the external declaration set is expected in a file named HEADER.dtd (on the header/HEADER element), which is created by sgmlproc when a template is applied on the header element just before processing of the template.

To produce output HTML equivalent to what we've produced iin the first variant (eg. with header and footer replaced by the respective content), invoke:

sgmlproc \
  -v active_lpd_names=WEB \
  content-using-conref-templating.sgm

where we activate the WEB link process to make sgmlproc apply template expansion.

About CONREF

SGML's #CONREF attribute semantics by itself means just that SGML parses an element on which a #CONREF attribute is specified in content as if it were declared EMPTY. In classical SGML, this would mean that end-element tags for the respective element must not be specified. However, sgmljs.net SGML infers FEATURES MINIMIZE EMPTYNRM YES as default SGML declaration setting, which means that end-element tags are tolerated, and can be omitted according to the respective tag omission indicator for end-element tags.

With sgmlproc, we could alternatively use/enforce classic expectations by SGML using the following main content file instead (content-using-conref-templating-emptynrm.sgm):

<!DOCTYPE html [
  <!ELEMENT header - O ANY>
  <!ELEMENT footer - O ANY>
  <!ELEMENT p - - (#PCDATA)>
  <!ATTLIST header ref NAME #CONREF>
  <!ATTLIST footer ref NAME #CONREF>
  <!ENTITY header SYSTEM "header.sgm">
  <!ENTITY footer SYSTEM "footer.sgm">
]>
<!LINKTYPE web html #IMPLIED [
  <!NOTATION sgml
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
  <!ENTITY header SYSTEM "header.sgm" NDATA sgml>
  <!ENTITY footer SYSTEM "footer.sgm" NDATA sgml>
  <!LINK #INITIAL [ ]>
]>
<html>
  <head>
    <title>The title</title>
  </head>
  <body>
    <header ref=header>
    <p>Body text</p>
    <footer ref=footer>
  </body>
</html>

where the end-element tags for header and footer are omitted, as per classical SGML defaults.

Note the custom declarations for the header and footer elements here have - O as tag omission indicators, meaning these elements can have their end-element tags omitted, when they normally (in HTML 5 and in the HTML 5.2 DTD) must have end-element tags specified explicitly.

sgmlproc \
  -v sgmldecl_features_minimize_emptynrm="NO" \
  -v active_lpd_names=WEB \
  content-using-conref-templating-emptynrm.sgm

Populating href attributes

sgmljs.net SGML can also apply templating on elements without using #CONREF entities, by specifying a template as a notation attribute in a link process on an element. More interestingly, this variant allows to grab attributes from source markup and supply those as system-specific entities to the template, which is what we want to do in our doc page template to supply id values from sections to href values in our nav-links.

We're returning to our running example for outlining/nav-link generation from two sections before here (doc1.sgm), and just add templating on <a> anchor elements within nav elements (<doc.sgm>):

<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
<!DOCTYPE htmlsource [
  <!ELEMENT htmlsource O O ANY>
  <!ELEMENT nav - - (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ENTITY PATH_TRANSLATED SYSTEM>
  <!ENTITY PATH_TRANSLATED_CONTENT SYSTEM>
  <!ENTITY % md_shortref_maps PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
  %md_shortref_maps;
]>
<!DOCTYPE html [
  <!ELEMENT html - - ANY>
  <!ELEMENT nav - - ANY>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ELEMENT h3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
]>
<!LINKTYPE toc htmlsource html [
  <!NOTATION anchor
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
    "anchor.sgm">
  <!ATTLIST #NOTATION anchor
    id CDATA #IMPLIED>
  <!ATTLIST (h2|h3)
    id CDATA #IMPLIED
    template NOTATION (anchor) #IMPLIED>
  <!LINK #INITIAL
    htmlsource html
    nav #USELINK in-nav nav
    h2 h2
    h3 h3
    p p>
  <!LINK in-nav
    #IMPLIED #USELINK in-nav2 ul>
  <!LINK in-nav2
    section2 #USELINK in-section2 li>
  <!LINK in-section2
    h2 [ template=anchor ] a
    section3 #USELINK before-section3 ul>
  <!LINK before-section3 #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3 h3 [ template=anchor ] a>
]>
<htmlsource>
<nav>

&PATH_TRANSLATED_CONTENT

</nav>

&PATH_TRANSLATED_CONTENT

</htmlsource>

We've added an SGML notation declaration for anchor here, and declared that the notation's content is found in anchor.sgm. anchor.sgm has the following content (anchor.sgm):

<!DOCTYPE #IMPLIED SYSTEM [
  <!ENTITY id SYSTEM>
  <!ENTITY content SYSTEM "<osfd>0">
]>
<a href="#&id">&content</a>

We also edited our link rules to apply on h2 and h3 elements when appearing as descendants of nav elements such that the template link attribute is set to anchor. When sgmljs.net SGML sees an SGML notation associated to an element being targetted in a link rule, it doesn't just rename the source element into the result element specified in the link rule, but it creates the result element content by applying the anchor template.

Since we've declared the id attribute on h2 and h3 elements in the link process declaration, and also declared the id attribute as a data attribute (attribute of the anchor notation), sgmljs.net SGML will now make available the value of the ID attribute as a system-specific entity of the same name to the template, in addition to providing the child content on which the template is applied via <osfd>0 as explained in Parsing HTML.

Putting it all together

If we now arrange for a subdirectory named after the template file (eg. doc), and have a content file (markdowncontent1.sgm, say) in that directory, we can open http://localhost:8080/doc/markdowncontent1 in a web browser to make sgmlweb apply our template to add a basic outline using the process described above. This works of course for any document of the form used above stored in the doc subdirectory.

To run the example on the command line:

sgmlproc -v active_lpd_names=TOC -- -e 'PATH_TRANSLATED_CONTENT=<osfile>doc/markdowncontent1.sgm' doc.sgm

and we'll see exactly the same markup the browser is receiving as rendered result: a markdown file rendered as HTML with an automatically added navigation list displayed on top of it.

Using SGML user agent for in-browser rendering

We'll add another, larger content file now (doc/markdowncontent2.sgm) containing lorem ipsum blind text instead of one-line body content, but otherwise structurally equivalent to doc/markdowncontent1.sgm.

We'll also add a header element now containing a primitive site menu with links to both http://localhost:8080/doc/markdowncontent1 and http://localhost:8080/doc/markdowncontent2 (for simplicity, we're not using header templating discussed above). Note the changes to doc.sgm in this section are not part of the download archive for the tutorial but must be copy/pasted manually into doc.sgm.

<header>
  <h1>My Site</h1>
  <ul>
    <li><a href="/doc/markdowncontent1">markdowncontent1</a></li>
    <li><a href="/doc/markdowncontent2">markdowncontent2</a></li>
  </ul>
</header>

So that the header, ul, and li element make it to the result markup, we need to edit the #INITIAL link set such that it contains mappings for these elements:

<!LINK #INITIAL
  ...
  header header
  ul ul
  li li
  ...

For <a> anchor elements, once again, we're using the following mapping rule

<!LINK #INITIAL
  ...
  a #IMPLIED
  ...

with the intent that the href attribute is carried over to the result <a> anchor element (which it wouldn't if we were chosing a simple a a mapping rule instead).

Moreover, we instruct the browser to load sgml-ua.min.js (the JavaScript code for SGML user agent) by adding

<script src="/scripts/sgml-ua.min.js" async="async"></script>

along with a link rule script #IMPLIED to doc.sgm such that it reads as follows with our changes:

<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
<!DOCTYPE htmlsource [
  <!ELEMENT htmlsource O O ANY>
  <!ELEMENT nav - - (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ENTITY PATH_TRANSLATED SYSTEM>
  <!ENTITY PATH_TRANSLATED_CONTENT SYSTEM>
  <!ENTITY % md_shortref_maps PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
  %md_shortref_maps;
]>
<!DOCTYPE html [
  <!ELEMENT html - - ANY>
  <!ELEMENT nav - - ANY>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ELEMENT h3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
]>
<!LINKTYPE toc htmlsource html [
  <!NOTATION anchor
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
    "anchor.sgm">
  <!ATTLIST #NOTATION anchor
    id CDATA #IMPLIED>
  <!ATTLIST (h2|h3)
    id CDATA #IMPLIED
    template NOTATION (anchor) #IMPLIED>
  <!LINK #INITIAL
      htmlsource html
      header header
      ul ul
      li li
      a #IMPLIED
      nav #USELINK in-nav nav
      h2 h2
      h3 h3
      p p
          script #IMPLIED>
  <!LINK in-nav
      #IMPLIED #USELINK in-nav2 ul>
  <!LINK in-nav2
      section2 #USELINK in-section2 li>
  <!LINK in-section2
      h2 [ template=anchor ] a
      section3 #USELINK before-section3 ul>
  <!LINK before-section3
      #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3
      h3 [ template=anchor ] a>
]>
<htmlsource>
<header>
  <ul>
    <li><a href="/doc/markdowncontent1">markdwoncontent1</a></li>
    <li><a href="/doc/markdowncontent2">markdowncontent2</a></li>
  </ul>
</header>
<nav>

&PATH_TRANSLATED_CONTENT

</nav>

&PATH_TRANSLATED_CONTENT

<script src="/scripts/sgml-ua.min.js"></script>
</htmlsource>

If we refresh our browser for http://localhost:8080/doc/markdowncontent1 and click on the link for http://localhost:8080/doc/markdowncontent2, we'll see that markdowncontent2 was produced in-browser, without sgmlweb processing on the web backend/Node.js. To see this, we need to open Web Developer Tools in our browser (Firefox or Chrome) and open the Network tab before refreshing. It will tell us that the browser has fetched /doc.sgm, and /doc/markdowncontent2 as individual static files, and composed the HTML document/DOM for the markdowncontent2 HTML document in the browser.

As you can see, the SGML user agent is designed as a drop-in replacement for server-side HTML composition, offloading SGML processing from the server to the browser. It works such that, upon load, it changes click behaviour of links to URLs on the origin host to perform broser-side SGML processing, executing the exact same JavaScript code as used on the server. It is envisioned that server-side SGML composition is only performed on the initial page load for a given web site, with subsequent page loads being processed entirely on the browser and the server only sending static files used for composition.

Now there's still something wrong with our browser-rendered document which is immediately visible when we're navigating from our landing page to the in-browser composed page: the page margins are missing on our in-browser composed page. The reason is simply that we're lacking a proper HTML body element, and also a header element with at least a page title, as is required for valid HTML. While the browser adds it automatically, SGML only adds header and body etc. element if it is instructed to do so by specifying a HTML DTD with tag inference and other content rules for HTML.

Using a HTML DTD is discussed to great length in the Parsing HTML Tutorial; here we just want to conclude our tutorial by specifying the required header, body, and title elements manually. For the title element, we populate its text content from PATH_TRANSLATED which expands to the client file name, and which we're already declaring in our SGML prolog anyway:

<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
<!DOCTYPE htmlsource [
  <!ELEMENT htmlsource O O ANY>
  <!ELEMENT nav - - (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ENTITY PATH_TRANSLATED SYSTEM>
  <!ENTITY PATH_TRANSLATED_CONTENT SYSTEM>
  <!ENTITY % md_shortref_maps PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
  %md_shortref_maps;
]>
<!DOCTYPE html [
  <!ELEMENT html - - ANY>
  <!ELEMENT nav - - ANY>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ELEMENT h3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
]>
<!LINKTYPE toc htmlsource html [
  <!NOTATION anchor
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
    "anchor.sgm">
  <!ATTLIST #NOTATION anchor
    id CDATA #IMPLIED>
  <!ATTLIST (h2|h3)
    id CDATA #IMPLIED
    template NOTATION (anchor) #IMPLIED>
  <!LINK #INITIAL
      htmlsource html
      header header
      ul ul
      li li
      a #IMPLIED
      nav #USELINK in-nav nav
      h2 h2
      h3 h3
      p p
      head head
      title title
      body body
          script #IMPLIED>
  <!LINK in-nav
      #IMPLIED #USELINK in-nav2 ul>
  <!LINK in-nav2
      section2 #USELINK in-section2 li>
  <!LINK in-section2
      h2 [ template=anchor ] a
      section3 #USELINK before-section3 ul>
  <!LINK before-section3
      #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3
      h3 [ template=anchor ] a>
]>
<htmlsource>
<head>
  <title>&PATH_TRANSLATED</title>
</head>
<body>
<header>
  <ul>
    <li><a href="/doc/markdowncontent1">markdwoncontent1</a></li>
    <li><a href="/doc/markdowncontent2">markdowncontent2</a></li>
  </ul>
</header>
<nav>

&PATH_TRANSLATED_CONTENT

</nav>

&PATH_TRANSLATED_CONTENT

<script src="/scripts/sgml-ua.min.js"></script>
</body>
</htmlsource>

If we want to test our modified template on the command line, we now have to actually supply PATH_TRANSLATED as system-specific entity:

sgmlproc -v active_lpd_names=TOC -- -e 'PATH_TRANSLATED=doc/markdowncontent1.sgm' -e 'PATH_TRANSLATED_CONTENT=<osfile>doc/markdowncontent2.sgm' doc.sgm