Blog

30 years of SGML

October 23, 2016 by Marcus

Today marks the 30th anniversary of the SGML standard publication.

As anyone having gone through the ISO 8879 and other ISO specification text will confirm, it's not an easy read, nor are other specifications building on SGML. Steven R. Newcomb, one of the editors of the HyTime standard, later even apologized for it:

In short, despite having great things to say, even the deathless prose of the HyTime standard tends to be unreadble and, quite frankly, to suck as informative literature (I'm a co-editor of it; may God have mercy on us.)
from material about HyTime architectural forms that seems to have vanished from the web, but probably Robin Cover's authorative collection of markup-related specifications, also started in 1986 by the way, has it covered somewhere

It seems that every markup specification kindof sucks. The XML specification, and especially the XML Schema specification, was widely despised.

And while the effort that has been put into the HTML5 specification has brought major improvements to the web, as a markup language (rather than JavaScript browser API) HTML5 hasn't brought anything new to the table. What it has brought wasn't universally appreciated.

Yet, for HTML5 it was found necessary to specify it's parsing rules and content grammar in informal prose and in a procedural fashion. While perhaps appealing to casual readers, this was a step back in terms of quality expectations towards specification work, and maybe the reason few parsers, if any, for HTML5 claim HTML5 conformity. To be fair, though, HTML5 was designed to deal with existing tag soup content on the web, so had to be extremely permissive, basically accepting anything as HTML.

W3C HTML 5.1 due later this year

In its upcoming 5.1 release, W3C HTML is scheduled to include WHATWG's custom elements feature (usable via JavaScript only, however, which should be quite worrying in itself with respect to long-term content preservation). Nevertheless, the custom element feature puts HTML into the class of markup meta languages like SGML and XML.

Given HTML's trajectory towards becoming a univeral markup meta language and it's torched earth attitude towards existing markup languages, as a proposal for the HTML 6 and HTML 7 roadmaps, here's a (non-exhaustive) list of features SGML has had for thirty years now (and in fact, much longer):

Injection-free text substitution

In modern web development, the proliferation of ad-hoc syntaxes for template "engines" and configuration files is abundant. At the time of this writing, curly braces seem to be in fashion for expressing variable substitution in text documents. Even in XML applications it is common to use curly brace syntax for this purpose, rather than using XML's (and SGML's) built-in entity mechanism. Almost all template engines are prone to injection attacks because, unlike entity references, they can't assess the syntactical context and escaping rules text is substituted into.

Custom Wiki syntaxes

SGML can parse user-defined Wiki syntaxes and other shorthand notations using the SHORTREF feature, by allowing context-dependent replacement of text tokens into markup tags or other text. It can even parse JSON.

Semantic markup

For many years now, web developers have strived to implement the idea of semantic markup, nevermind the fact that in eg. classic computer science, the term semantic> is used as opposed to syntactic in a dichotomic sense. Now a markup language is very much a syntactic construct, hence by that definition of the word, it can't be semantic.

As a rational basis for semantic markup, perhaps the term content reusability expresses more readily what web developers aim for here. Looking at the web today, it cannot be said that this has been achieved (just look at this website's pityful page source which is littered with divs and other presentation artifacts required for bootstrap's CSS).

The "semantic markup" discussion basically is a consequence of the fact that web developers are confronted with HTML and CSS as two separate languages, and of a developer's mindset attempting to rationalize this situation.

In SGML, on the other hand, attributes were originally introduced to capture rendering properties, much like CSS properties today, whereas content was put into element child text. In SGML, the notion that markup shouldn't contain presentation details is solved by using "link types", which, like CSS, can define an automaton for selecting applicable rendering properties in a context-dependent way, without having to specify those in main text copy. Unlike with CSS, this task is solved within the SGML language itself, and isn't shifted into separate ad-hoc syntax.

Is directly editing HTML, a content delivery language, really the right approach for authoring larger amounts of content? Is using CSS and JavaScript a sustainable strategy for preservation of content for generations to come?

Today, after many years of achievements in the vibrant markup community, it seems that standards fatigue has settled in, and that a whole generation of domain experts who believed in markup language technology have lost their voice or have retired silently.

Overusing XML for problems unrelated to semistructured data in the 2010's, the failure of XHTML, and other failures of the W3C haven't helped either. Standards development for markup languages has largely been stalled or abandoned.

Fortunately, SGML still stands as a practical bona fide language for tackling large-scale content authoring and preservation tasks, which are the kinds of problems SGML was designed to solve almost 50 years ago. It's specification process took almost a decade, and then another decade into becoming an ISO standard. As it stands, we're not going to see an international standard of SGML's calibre anytime soon.

SGML can parse and process most text content formats being used today, including HTML, XML, Wiki syntaxes, and JSON; it can formalize HTML's notion of omitting elements and tag inference, and it has practical integrated content organization, templating, and other authoring mechanisms.

sgmljs.net's SGML reference and Templating reference contains a modern account on these topics.

SGML