The World Wide Web Consortium (W3C) has recently cancelled the editorial process around specification text for HTML, and announced an intent to publish snapshot versions of upstream specification source text in the future instead. For sustained access to web content, and in the interest of more readily usable HTML markup specifications for the document engineering community, we look at an encoding of W3C’s last HTML version into a DTD grammar making use of SGML as the only markup technology capable to tackle formal processing of HTML based on an international standard.
The tutorial will re-introduce SGML concepts not part of XML such as tag inference/omission and other markup minimization techniques.
As an update to the author’s previous work, an SGML Document Type Definition (DTD) for the version of HTML described in W3C’s HTML 5.2 specification text is introduced, its construction process is explained, and its applications and limitations are discussed. Some initial validation results based on W3C’s/WHATWG’s webtest suite are reported.
After discussing theoretic foundations, the tutorial will provide practical, hands-on exercises and examples for common HTML parsing tasks as they materialize in digital heritage preservation, archival of legal, personal, or business documents stored as HTML, building corpora of web documents, acquiring and integrating HTML fragments into publishing workflows, and other fields of application.
Moreover, SGML techniques for authoring and producing HTML such as SGML’s integrated facilities for parsing custom Wiki syntaxes (markdown) or casual math formatting and techniques for generic markup processing such as transforming sectioning structure from flat-earth markup into a hierarchically nested form using SGML (also known as the HTML 5 outlining algorithm), and type-safe templating will be studied in hands-on exercises.
There is ample room for other topics of interest, and participants are encouraged to bring their own, possibly malformed or otherwise unusual and challenging HTML material, put the HTML 5.2 DTD to test and gain insights into it’s customizability and limitations.
Time is dedicated for a concluding discussion on core web standardization in light of recent events. The author also would like to reach out to the document engineering community for feedback and collaboration on formal specifications of web standards.
Tutorial attendants will make use of SGML software provided for download on the tutorial web site. The required SGML processing utility – sgmlproc – is a Unix command line application and can be installed by simply downloading a single file during the tutorial. As it is written for Unix (Linux, Mac OS), Windows users are advised to either install WSL (the Windows subsystem for Linux) for running Ubuntu Linux on Windows first, or install an (equivalent) sgmlproc program as a Node.js package.
As part of the hands-on exercises, also the venerable (Open)SP package, widely regarded as reference SGML software, can be used to process test HTML input data. Instructions for performing necessary adaptions to naming rules, predefined entities, unsupported WebSGML features, and other obstacles for working effectively with HTML material on Open(SP) or other third-party SGML software are given during exercises.
No prior experience in SGML is assumed, but some working knowledge of XML and HTML is expected. The tutorial can also be followed by merely studying supplied tutorial materials without actually performing hands-on exercises.
Marcus Reichardt has over 25 years of experience in commercial document engineering, computational linguistics, logic, and other areas of software engineering, and is founding developer of the sgmljs.net SGML software. He started his initiative for a new SGML formulation of HTML 5 in 2015 based on WHATWG’s and W3C’s initial recommendation text, and has since tracked subsequent versions. He provides the HTML DTD discussed here free of charge and under a liberal license on the project’s home page.