SGML

Producing HTML

producing-html-tutorial.tgz
Download all
content-using-entity-references.sgm
Content file using (parsed) entity expansion for boilerplate inclusion
footer.ent
Footer boilerplate as text data
header.ent
Header boilerplate as text data
content-using-conref-templating.sgm
Content file using templating with #CONREF entities for boilerplate inclusion
footer.ent
Footer boilerplate as SGML document (with DOCTYPE)
header.ent
Header boilerplate as SGML document (with DOCTYPE)
content-using-conref-templating-emptynrm.sgm
Content file using templating with #CONREF entities with omitted end-element tags on header and footer elements
markdown-emph.sgm
Basic use of short references to produce span-level markup
markdown-headings.sgm
Simplistic use of short references to produce headings and paragraphs
markdown-headings-builtin.sgm
Producing headings with build-in markdown parsing embedded in SGML
markdown-headings-builtin.md
Producing headings from regular markdown (.md) file
Note: to execute tests in this directory, download the sgmlproc command-line app for Linux or Mac OS, or get sgmlproc by installing the SGML package for Node.js (in the latter case, unless the sgml package is installed globally, sgmlproc is invoked by using node_modules/.bin/sgmlproc on the command line)

Introduction

This tutorial gives an introduction to building basic websites from HTML and markdown text fragments with the help of light content extraction and transformation techniques for generating page navigation.

Composing HTML documents

As a simplistic example for organizing web content around sharing common content, we're going to add header and footer content boilerplate to an SGML file that we indend to publish as a static HTML site. The expectation here is that we're going to have muliple pages, each sharing common head metadata, header (with eg. a menu), and footer content (with eg. legal notices), and similar shared content.

So this is what our produced content file(s) should look like:

<html>
  <head>
    <title> ... </title>
  </head>
  <body>
    <header> ... </header>
    <main> ... </main>
    <footer> ... </footer>
  </body>
</html>

where we want to have boilerplate for head, header, and footer populated by using SGML, and keep actual content files free from redundant head, header, and footer elements. Instead, we want our content files to look as follows:

<title>The title</title>
<p>Body text</p>

A simple way to do this (with just header and footer content for now) is storing header and footer in separate files and using general entities to pull content in from those files into our main content file(s) (content-using-entity-references.sgm):

<!DOCTYPE html [
  <!ENTITY header SYSTEM "header.ent">
  <!ENTITY footer SYSTEM "footer.ent">
]>
<html>
  <head>
    <title>The title</title>
  </head>
  <body>
    &header
    <p>Body text</p>
    &footer
  </body>
</html>

where header.sgm contains HTML text such as

<header>
  <h1>My Site</h1>
</header>

and footer.sgm contains eg.

<footer>
  <p>Copyright by me;</p>
  <p>Contact: abuse@mysite.com</p>
</footer>

Use

./sgmlproc content-using-entity-references.sgm

to produce the expected, completely assembled HTML file.

With sgmljs.net SGML, there's another, more sophisticated way to do this, and one that helps in reducing further redundancies (content-using-conref-templating.sgm):

<!DOCTYPE html [
  <!ATTLIST header ref ENTITY #CONREF>
  <!ATTLIST footer ref ENTITY #CONREF>
  <!ENTITY header SYSTEM "header.sgm">
  <!ENTITY footer SYSTEM "footer.sgm">
]>
<!LINKTYPE web html #IMPLIED [
  <!NOTATION sgml
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
  <!ENTITY header SYSTEM "header.sgm" NDATA sgml>
  <!ENTITY footer SYSTEM "footer.sgm" NDATA sgml>
  <!LINK #INITIAL [ ]>
]>
<html>
  <head>
    <title>The title</title>
  </head>
  <body>
    <header ref=header></header>
    <p>Body text</p>
    <footer ref=footer></footer>
  </body>
</html>

This variant

  • specifies replacement text files via ENTITY attributes,

  • declares the ref attribute to have #CONREF semantics, such that SGML expects the element to have no syntactical content

  • has entities for header and footer declared in both the base document type declaration, as well as preempted (overriding) entity declarations for these in the web link process declarations as SGML entities,

  • has system identifiers (filenames) of entities end in .sgm (so different files are being addressed from those used in the first variant), with header.sgm and footer.sgm meant to include a doument type declaration.

For example, header.sgm looks as follows:

<!DOCTYPE #IMPLIED SYSTEM>
<header>
  <h1>My Site</h1>
</header>

(and similarly, footer.sgm has <!DOCTYPE #IMPLIED SYSTEM> as well, extending our former header.ent and footer.ent files into stand-alone SGML files).

<!DOCTYPE #IMPLIED ...> means that the document element is the first element actually encountered in the file.

Moreover, SYSTEM in this context means that the content of the external declaration set is expected in a file named HEADER.dtd (on the header/HEADER element), which is created by sgmlproc when a template is applied on the header element just before processing of the template.

To invoke production of output HTML equivalent to what the first variant produces (eg. with header and footer replaced by the respective content):

./sgmlproc \
  -v active_lpd_names=WEB \
  content-using-conref-templating.sgm

where we activate the WEB link process to make sgmlproc apply template expansion.

About CONREF

SGML's #CONREF attribute semantics by itself means just that SGML parses an element on which a #CONREF attribute is specified in content as if it were declared EMPTY. In classical SGML, this would mean that end-element tags for the respective element must not be specified. However, sgmljs.net SGML, infers FEATURES MINIMIZE EMPTYNRM YES as default SGML declaration setting, which means that end-element tags are tolerated, and can be omitted according to the respective tag omission indicator for end-element tags.

With sgmlproc, we could alternatively use/enforce classic expectations by SGML using the following main content file instead (content-using-conref-templating-emptynrm.sgm):

<!DOCTYPE html [
  <!ELEMENT header - O ANY>
  <!ELEMENT footer - O ANY>
  <!ELEMENT p - - (#PCDATA)>
  <!ATTLIST header ref NAME #CONREF>
  <!ATTLIST footer ref NAME #CONREF>
  <!ENTITY header SYSTEM "header.sgm">
  <!ENTITY footer SYSTEM "footer.sgm">
]>
<!LINKTYPE web html #IMPLIED [
  <!NOTATION sgml
    PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
  <!ENTITY header SYSTEM "header.sgm" NDATA sgml>
  <!ENTITY footer SYSTEM "footer.sgm" NDATA sgml>
  <!LINK #INITIAL [ ]>
]>
<html>
  <head>
    <title>The title</title>
  </head>
  <body>
    <header ref=header>
    <p>Body text</p>
    <footer ref=footer>
  </body>
</html>

where the end-element tags for header and footer are omitted, as per classical SGML defaults.

Note the custom declarations for the header and footer elements here have - O as tag omission indicators, meaning these elements can have their end-element tags omitted, when they normally (in HTML 5 and in the HTML 5.2 DTD) must have end-element tags specified explicitly.

./sgmlproc \
  -v sgmldecl_features_minimize_emptynrm="NO" \
  -v active_lpd_names=WEB \
  content-using-conref-templating-emptynrm.sgm

Parsing markdown

Parsing markdown using short references

Custom Wiki syntaxes such as markdown are as old as digital text processing itself. SGML lets you define element context-specific token replacement rules for this purpose. For example, to make SGML format a simplistic markdown fragment into HTML, you could use an SGML prolog like this (markdown-emph.sgm):

<!DOCTYPE p [
  <!ELEMENT p - - ANY>
  <!ELEMENT em - - (#PCDATA)>
  <!ENTITY start-em '<em>'>
  <!ENTITY end-em '</em>'>
  <!SHORTREF in-p '*' start-em>
  <!SHORTREF in-em '*' end-em>
  <!USEMAP in-p p>
  <!USEMAP in-em em>
]>
<p>The following text:
   *this*
   will be put into EM
   element tags</p>

If processed with sgmlproc eg.

./sgmlproc markdown-emph.sgm

SGML will produce canonical syntax as follows:

<p>The following text:
   <em>this</em>
   will be put into EM
   element tags</p>

This works by declaring, via SHORTREF short reference maps (in-p and in-em) associating tokens (the * asterisk token in both rules) to replacement entities, and then make those maps active via USEMAP short reference use declarations in a given element context.

If the context (top-most) element is em, the in-em shortref map is current (as per the second USEMAP declaration), which defines the replacement text for * to be </em>, ending the emphasized text span. Whereas within p, it's <em>, starting an emphasized text span, and making em the context element.

As a slight variation, h2 heading elements can be produced from text enclosed in double-hashmark (##) characters, as used in markdown syntax, with p paragraph elements being added by markdown formatting:

<!DOCTYPE body [
  <!ELEMENT body O O ((h2,p)+)>
  <!ELEMENT p O O (#PCDATA)>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ENTITY start-h2 '<h2>'>
  <!ENTITY end-h2 '</h2>'>
  <!SHORTREF in-body '##' start-h2>
  <!SHORTREF in-h2 '##' end-h2>
  <!USEMAP in-body body>
  <!USEMAP in-h2 h2>
]>
<body>

## Heading 1 ##

Body text of first section.

</body>

Parsing full markdown syntax using sgmljs.net SGML

Note full markdown syntax formatting is impossible to implement using just short references. For example, markdown's reference links feature allows link details to be populated from data placed elsewhere in the document such that eg. link URLs and titles can be forward- or backward-referenced within a document. This can't be handled by short references which only act locally in a given element context.

For full markdown formatting, sgmljs.net SGML has built-in "virtual" short reference rules that, when referenced (included) in the base document type declaration, will make sgmljs.net SGML recognize and format markdown into HTML as expected:

<!ENTITY % md_shortref_maps
  PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
    %md_shortref_maps;

The former example, rewritten to make use of built-in shortref rules for markdown, looks as follows (markdown-headings-builtin.sgm):

<!SGML MARKDOWN PUBLIC "+//IDN sgmljs.net//SD Markdown//EN">
<!DOCTYPE body [
  <!ELEMENT body O O ((h2,p)+)>
  <!ELEMENT p O O (#PCDATA)>
  <!ELEMENT h2 - - (#PCDATA)>
  <!ENTITY % md_shortref_maps
    PUBLIC "+//IDN sgmljs.net//SHORTREF Markdown//EN">
  %md_shortref_maps;
]>
<body>

## Heading 1 ##

Body text of first section.

</body>

The first line is an SGML declaration reference we need to include such that sgmlproc assumes availability of short reference delimiters needed for markdown and HTML naming rules in a way that is compatible with third-party SGML software.

Pulling-in md_shortref_maps will enable comprehensive markdown formatting. Note there's no actual short reference declaration set being resolved by the +//IDN sgmljs.net//SHORTREF Markdown//EN public identifier; these declarations are resolved/recognized specially by sgmljs.net SGML and are implemented using an internal markdown-to-HTML converter. The purpose of presenting markdown formatting as short reference application is provided for uniformity and compatibility with third-party SGML software.

Note sgmlproc includes these definitions by default when processing files having an .md file name suffix. We can omit including an SGML declaration if we rename our file to process such that it has an .md file suffix, in which case the necessary SGML declaration settings will be automatically assumed by sgmlproc.

For example, markdown-headings-builtin.md looks like this:

## Heading 1 ##

Body text of first section.

and when processed via

sgmlproc markdown-headings-builtin.md

will be formatted into:

<h2 id="heading-1">Heading 1
</h2><p>Body text of first section.
</p>

HTML Outlining

This example demonstrates how to automatically create a table of content from basic HTML sectioning and/or heading elements using link processing and templating.

An outline is useful for generating a table of content, for assistive technologies, and for generation of page navigation elements.

Specifically, given a source HTML document similar to the following HTML markup not using HTML5' sectioning elements

<h2 id="heading-a">A Level Two Heading</h2>
<p>Level Two Content</p>
<p>Other Level Two Content</p>
<h2 id="heading-b">Another Level Two Heading</h2>
<p>Yet other Level Two Content</p>

we want to create a <nav> element as follows

<nav>
  <ul>
    <li><a href="#heading-a">A Level Two Heading</a></li>
    <li><a href="#heading-b">Another Level Two Heading</a></li>
  </ul>
</nav>

Moreover, we want to compose the result <nav> element with the source content into a compound HTML document such that source content appears as main content, and generated <nav> content as side-navigation (or top-navigation) content.

Producing sectioning roots from headings

HTML 5 has introduced sectioning elements as a means to hierarchically structure documents, where earlier HTML versions had only ranked heading elements for representing hierarchy ("flat-earth markup").

When sectioning elements are used, the markup for a heading element and the belonging body text, as well as potential subsections, have a common ancestor element, the sectioning root (a section, main, article or other element acting as sectioning root).

<section>
  <h2>Section heading</h2>
  <p>Section content text</p>
  <!-- potential subsections here ... -->
</section>
<section>
  <h2>Next section heading</h2>
  <p>Other content</p>
  <!-- potential subsections here ... -->
</section>

Traditional "flat-earth HTML markup" doesn't require a common (sectioning or other) element structurally enclosing the heading and it's belonging section content:

<h2>Section heading</h2>
<p>Section content text</p>
<!-- ... --->
<h2>Next section heading</h2>
<p>Other content</p>
<!-- ... --->

sgmljs.net SGML is designed to be used with Markdown text. Markdown doesn't have Wiki markup for sectioning as such, but, like earlier versions of HTML, for heading elements only. To impose sectioning structure onto markdown text explicitly, section (or other sectioning root) elements would have to be specified as HTML block elements within markdown text such as in the following example:

<section>

# Heading #

Markdown text with enclosing sectioning root
as markup block

</section>

This is however redundant and rarely seen in practice.

SGML can infer (ranked) section tags from "flat-earth markup" by parsing HTML with a custom DTD as straightforward as

<!DOCTYPE html [
  <!ELEMENT html O O (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ELEMENT li - - (ul*)>
  <!ELEMENT ul - - (#PCDATA)>
]>
<html>
  <h2>Section One Heading</h2>
  <p>Section One Body Text</p>
  <h2>Section Two Heading</h2>
  <p>Section Two Body Text</p>
  <h3>Subsection Two Dot Two Heading</h3>
  <p>Subsection Two Dot Two Body Text</p>
</html>

The parsing result contains inferred section2 and section3 elements as follows:

<html>
  <section2>
    <h2>Section One Heading</h2>
    <p>Section One Body Text</p>
  </section2>
  <section2>
    <h2>Section Two Heading</h2>
    <p>Section Two Body Text</p>
    <section3>
      <h3>Subsection Two Dot Two Heading</h3>
      <p>Subsection Two Dot Two Body Text</p>
    </section3>
  </section2>
</html>

Note in order to obtain HTML, the rank suffixes for section2 and section3 would have to be removed (using straightforward renaming into plain section elements in a link process). This isn't shown here in detail however, since for the hierarchy text template we don't want to produce sectioning elements as such, but want to use sectioning elements for producing navigation link markup, as shown next.

Generation of nav links into an ul container element involves inferring sectioning from heading elements as shown above in a first step, followed by transforming sectioning structure into nav links:

<!DOCTYPE html [
  <!ELEMENT html O O (section2+)>
  <!ELEMENT section 2 O O (h2,p*,section3*)>
  <!ELEMENT section 3 O O (h3,p*)>
  <!ELEMENT h 2 - - (#PCDATA)>
  <!ELEMENT h 3 - - (#PCDATA)>
  <!ELEMENT p - - (#PCDATA)>
  <!ELEMENT a - - (#PCDATA)>
  <!ELEMENT li - - (ul*)>
  <!ELEMENT ul - - (#PCDATA)>
]>
<!DOCTYPE ul [
  <!ELEMENT nav O O (ul)>
  <!ELEMENT ul (li+)>
  <!ELEMENT li (a,ul*)>
  <!ELEMENT a (#PCDATA)>
]>
<!LINKTYPE toc html ul [
  <!LINK #INITIAL
    html ul
    section2 #USELINK in-section2 li>
  <!LINK in-section2
    h2 a
    section3 #USELINK before-section3 ul>
  <!LINK before-section3 #IMPLIED #USELINK in-section3 li>
  <!LINK in-section3 h3 a>
]>
<html>
  <h2>Section One Heading</h2>
  <p>Section One Body Text</p>
  <h2>Section Two Heading</h2>
  <p>Section Two Body Text</p>
  <h3>Subsection Two Dot Two Heading</h3>
  <p>Subsection Two Dot Two Body Text</p>
</html>

The toc link process transforms html into ul, and (inferred) section2 elements into li elements; on section2, the in-section2 link set is made active which will generate <a> anchors from h2 headings, and put the heading text as hyperlink text for the anchor. Furthermore, on section3 subsection elements, a nested ul list is opened, and then the before-section3 link set immediately generates a li element on a virtual #IMPLIED element (according to sgmljs.net SGML's handling of link rules with #IMPLIED source elements before proceeding to transform headings into <a> anchors.