node_modules/.bin/sgmlproc
on the command line instead of ./sgmlproc
.
In this tutorial, we're going to learn how to perform
parsing of real-world HTML as used on actual
websites. For the tutorial, we just pick
a random page of the first site listed as being in danger of
shutting down soon on https://indieweb.org/site-deaths
,
Doc Searls blog (entries posted in 2019).
Either open that page in a web browser and save it as a file
in a temporary folder, or use the wget
or curl
command-line
programs to do the same:
curl http://blogs.harvard.edu/doc/ > blogpage.html
If you're using a browser, this will create a file named
"Doc Searls Weblog - Holding forth on stuff since 1998.html"
(and potentially a folder named similarly containing images
and other resources linked from the page, which we however
don't bother with for now). For simplicity, it's recommended
(and also assumed in the following text) to rename that long
file to just blogpage.html
.
Now we can perform our first attempt at parsing the file
using sgmlproc
:
./sgmlproc blogpage.html
The output (standard and error output) will mix content and error messages such as the following (where the first type of error messages will appear multiple times, and the error messages generally will contain large JavaScript code text portions which are omitted for brevity here):
"blogpage.html": line 10: warning: '<lots of Javascript code text>' : unquoted '&' character
"blogpage.html": line 10: 'j.getContext': unresolved entity reference
"blogpage.html": line 12: fatal: '<i.lengt...': unterminated element or invalid < character in element or attributes
What's going on is that SGML doesn't know that the &
ampersand
and <
less-than characters shouldn't be treated as markup delimiters
at the places it complains about, the latter error being a fatal error
and making sgmlproc
stop further processing.
To tell SGML that it is parsing HTML, we need to download and place
a HTML 5 DTD into the directory, edit the blogpage.html
file and declare
a document type definition, indicating to SGML
that script
element content should be treated as CDATA
(unparsed character data). To do so, we open blogpage.html
in a text editor and change
<!DOCTYPE html>
into
<!DOCTYPE html SYSTEM "html52mini.dtd">
where html52mini.dtd
refers to a local download of the
HTML 5.2 mini-DTD
described in HTML5.2 DTD Reference.
With the changed file stored as blogpage-with-html52mini-doctype.html
,
executing
./sgmlproc blogpage-with-html52mini-doctype.html
again will make sgmlproc
at least parse the file
completely and output canonically-formatted HTML markup.
However, there are still error messages reported by
sgmlproc
buried in its large terminal console output left.
We shall instruct sgmlproc
to output its parsed and
re-serialized markup into a file rather than onto the
terminal:
./sgmlproc -- -o out.html blogpage-with-html52mini-doctype.html
This will make sgmlproc
store HTML output into out.html
and report only error messages onto the terminal:
"blogpage-with-html52mini-doctype.html": line 33: element LINK: attribute HREF: 'https://fonts.googleapis.com/css?family=Open+Sans%3A300italic%2C400italic%2C600italic%2C300%2C400%2C600&subset=latin%2Clatin-ext&ver=4.8.1': invalid value for declared data notation
"blogpage-with-html52mini-doctype.html": line 154: element A: attribute HREF: 'http://@robwilliamsNY': invalid value for declared data notation
"blogpage-with-html52mini-doctype.html": line 291: element A: attribute HREF: 'http://Marvel-Like Universe in Which All of Us are Enhanced': invalid value for declared data notation
The reason for sgmlproc
warning about URLs not having proper
syntax is that W3C's HTML5.2 specification (kindof) recommends
use of RFC 3986
URLs as opposed to the more permissive variant specified in
WHATWG's URL Standard.
The HTML5.2 DTD (and HTML5.2 mini-DTD) represents this lexical
constraint by declaring URL-typed attributes with WebSGML
data specification attributes for a custom URL lexical
type derived from the description of HTML <form>
input values.
For example, the href
attribute on HTML <a>
anchor elements
is declared as follows in html52mini.dtd
:
<!ENTITY % if_uri_data_spec_attr "INCLUDE">
<![ %if_uri_data_spec_attr [
<!NOTATION uri
PUBLIC "+//IDN www.w3c.org/TR/html5//NOTATION HTML Form Input Types//EN">
<!ATTLIST #NOTATION uri type (url) #FIXED url>
<!ENTITY % URI "DATA uri">
]]>
<!ENTITY % no_uri_data_spec_attr "INCLUDE">
<![ %no_uri_data_spec_attr [
<!ENTITY % URI "CDATA">
]]>
...
<!ATTLIST a href %URI #IMPLIED>
This delaration for the %URI
parameter entity is
put into a marked section such that it can be conditionally
included or excluded based on the value of the
if_uri_data_spec_attr
parameter entity. To switch off
checking for RFC 3986
conformance on URI-typed attributes, we can change the internal
subset in the SGML prolog for our running blogpage example document
to preempt if_uri_data_spec_attr
with a value of IGNORE
,
thereby overriding the DTD's default of INCLUDE
, and store
the edited file as eg. blogpage-with-html52mini-doctype-and-custom-url
:
<!DOCTYPE html SYSTEM "html52mini.dtd" [
<!ENTITY % if_uri_data_spec_attr "IGNORE">
]>
This will declare the href
and other URI-typed
attributes as plain CDATA
attributes, and make
our remaining error messages go away when re-invoking
sgmlproc
on it:
./sgmlproc -- -o out.html blogpage-with-html52mini-doctype-and-custom-url.html
We may want to compare the input file to the output of sgmlproc
to check if sgmlproc
has actually done anything at all. So we're
going to run basic file comparison on the input and output file:
diff blogpage-with-html52mini-doctype-and-custom-url.html out.html
Among lots of output from diff
listing differences between
input and ouput HTML, at the end of the file we see the following
lines:
< <script type='text/javascript' src='https://stats.wp.com/e-201932.js' async defer></script>
< <script type='text/javascript'>
---
> <script async="async" defer="defer" src="https://stats.wp.com/e-201932.js" type="text/javascript"></script>
> <script type="text/javascript">
telling us that sgmlproc
has indeed changed the enumerated
attributes async
and defer
into the canonical notations
async="async"
and defer="defer"
, respectively.
This also shows us that the input file indeed uses HTML 5
features (async
and defer
were introduced in HTML version
5).
Note with sgmljs.net SGML, we could equivalently use the following:
<!DOCTYPE html SYSTEM "about:legacy-compat">
or, with our URL parsing customization included:
<!DOCTYPE html SYSTEM "about:legacy-compat" [
<!ENTITY % if_uri_data_spec_attr "IGNORE">
]>
This needs some explanation: both <!DOCTYPE html>
and <!DOCTYPE html SYSTEM "about:legacy-compat">
(but
no other strings) are valid, interchangeable DOCTYPE
strings as far as HTML is concerned - HTML simply ignores
these DOCTYPE declarations at the begin of a file.
But to SGML, <!DOCTYPE html>
means that the external
subset (where markup declarations for elements, attribute,
etc. are expected) is empty, whereas
<!DOCTYPE html SYSTEM "about:legacy-compat">
tells SGML that the content of the external subset is to be
found using a system identifier (eg. a file name or similar)
named about:legacy-compat
. Now sgmljs.net SGML has
built-in support for resolving about:legacy-compat
to the W3C HTML 5.2 mini-DTD
(it has the HTML declaration set bundled in the
executable sgmlproc
file).
The venerable OpenSP software (fork of James Clark's original SP SGML processing package) is widely regarded as SGML reference implementation. For making it work with modern HTML, we need to consider the following points.
We must use the full HTML5 DTD which is the last backward-compatible
HTML 5.x DTD that can be used with OpenSP (due to WebSGML features
not implemented in OpenSP, such as declaring attributes
on #ALL
elements)
We must include an SGML declaration (we're using the updated
SGML declaration for HTML 4), and the SGML declaration
used must contain MINIMIZE SHORTTAG YES
rather than the WebSGML syntax for granular/unbundled SHORTTAG
features
We must supply SP_BCTF=utf-8
as environment variable to
osgmlnorm
so OpenSP interprets bytes/characters properly
and doesn't balk about non-SGML characters
With these changes in place, our running example HTML begins
as follows now (blogpage-with-html5-doctype-and-modified-html51-dcl.sgm
):
<!SGML "ISO 8879:1986 (WWW)"
--
Based on the SGML Declaration for HTML 4 (html.dcl), with
the following modifications:
- adapted GRPGTCNT, GRPCNT, ATTCNT
- adapted to WebSGML: minimum data, using extended markup minimization
feature syntax introduced with ISO 8879 Annex K
See also html5-with-svg-and-mathml.dcl
--
CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- SURROGATES --
57344 1056768 57344
CAPACITY SGMLREF
SCOPE DOCUMENT
SYNTAX
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
BASESET "ISO 646IRV:1991//CHARSET
International Reference Version
(IRV)//ESC 2/8 4/2"
DESCSET 0 128 0
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-_:"
UCNMCHAR ".-_:"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
HCRO "&#x" -- 38 is the number for ampersand --
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTCNT 384 -- increased for HTML51 + SVG --
ATTSPLEN 65536 -- These are the largest values --
LITLEN 65536 -- permitted in the declaration --
NAMELEN 65536 -- Avoid fixed limits in actual --
PILEN 65536 -- implementations of HTML UA's --
TAGLVL 100
TAGLEN 65536
GRPGTCNT 1024 -- increased for MathML --
GRPCNT 256 -- increased for HTML 5, MathML --
FEATURES
MINIMIZE DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
-- WebSGML: --
-- STARTTAG EMPTY NO
UNCLOSED NO
NETENABL NO
ENDTAG EMPTY NO
UNCLOSED NO
ATTRIB DEFAULT YES
OMITNAME YES
VALUE YES --
EMPTYNRM NO
IMPLYDEF ATTLIST YES
DOCTYPE NO
ELEMENT NO
ENTITY NO
NOTATION NO
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL NO
APPINFO NONE
>
<!DOCTYPE html SYSTEM "html5.dtd">
<html lang="en-US"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Doc Searls Weblog · Holding forth on stuff since 1998</title>
...
We can now invoke
SP_BCTF=utf-8 osgmlnorm blogpage-with-html5-doctype-and-modified-html51-dcl.sgm
(or another program from the OpenSP suite) for parsing our running HTML page with only 23 (recoverable) errors. Note these errors are HTML validation errors we haven't seen with sgmljs.net SGML since we've used the mini-DTD which doesn't have all element and attribute declarations.
While not a problem for our HTML at hand, in general OpenSP
and other third-party SGML software without support for
full WebSGML will have trouble parsing URI values
(in eg. HTML href
and src
attributes) when these contain
&
ampersand characters. Depending on the subsequent character,
OpenSP will complain about entity references formed by ampersand
characters not being resolvable, or worse, bogusly expand what
it sees as entity references when a general entity happens to be
declared for a token appearing after ampersand characters in HTML
URI attributes.
In sgmljs.net SGML, this is solved using WebSGML data specification attributes (as explained above) which don't get entity-expanded. On OpenSP, on the other hand, this problem can only be solved by accepting (a potential large number of) recoverable errors and let OpenSP output URLs as-is from input.
Continuing with our running example, we now want to produce XML from the input, to then feed it into some of the many available XML processing tools for further extraction or other processing.
With sgmljs.net SGML, this can be done by using
./sgmlproc \
-v output_format=xml \
-v dtd_handling=omit \
blogpage-with-html52mini-doctype-and-custom-url.html
and with OpenSP, by running
SGML_BCTF=utf-8 osx -E 1000 \
blogpage-with-html5-doctype-and-modified-html51-dcl.sgm
where osx
is the program of the OpenSP suite specifically
designed for XML conversion, and where we must increase OpenSP's
threshold for errors to 1000 so that it doesn't prematurely abort
processing due to too many character encoding errors.
With XML output serialization, elements with declared content
EMPTY
, such as the img
, meta
, and hr
elements, will
be output with end-element tags or XML-style
empty-element tags.
Using dtd_handling=omit
makes sgmlproc
skip outputting
a DOCTYPE declaration (which we must do because HTML DTDs are
SGML DTDs, and can't be used with XML-only parsers).
So this is what the output of the sgmlproc
command stated
above looks like (with osx
's output being similar):
<HTML LANG="EN-US"><HEAD>
<META CONTENT="text/html; charset=UTF-8" HTTP-EQUIV="Content-Type">
</META><TITLE>Doc Searls Weblog · Holding forth on stuff since 1998</TITLE>
<LINK HREF="//s0.wp.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.gravatar.com" REL="dns-prefetch"></LINK>
<LINK HREF="//s.w.org" REL="dns-prefetch"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/feed/" REL="alternate" TITLE="Doc Searls Weblog » Feed" TYPE="application/rss+xml"></LINK>
<LINK HREF="http://blogs.harvard.edu/doc/comments/feed/" REL="alternate" TITLE="Doc Searls Weblog » Comments Feed" TYPE="application/rss+xml"></LINK>
<SCRIPT TYPE="text/javascript"><![CDATA[
window._wpemojiSettings = {"baseUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/72x72\/","ext":".png","svgUrl":"https:\/\/s.w.org\/images\/core\/emoji\/2.3\/svg\/","svgExt":".svg","source":{"concatemoji":"http:\/\/blogs.harvard.edu\/doc\/wp-includes\/js\/wp-emoji-release.min.js?ver=4.8.1"}};
...
]]></SCRIPT>
<STYLE TYPE="text/css">
img.wp-smiley,
img.emoji {
display: inline !important;
border: none !important;
box-shadow: none !important;
height: 1em !important;
width: 1em !important;
margin: 0 .07em !important;
vertical-align: -0.1em !important;
background: none !important;
padding: 0 !important;
}
</STYLE>
...
<META CONTENT="Holding forth on stuff since 1998" NAME="description">
</META><META CONTENT="all" NAME="robots">
</META><LINK HREF="http://gmpg.org/xfn/11" REL="profile">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/style.css" MEDIA="all" REL="stylesheet" TYPE="text/css">
</LINK><LINK HREF="http://blogs.harvard.edu/doc/wp-content/themes/tarski/library/css/print.css" MEDIA="print" REL="stylesheet" TYPE="text/css">
...
</HEAD>
<BODY CLASS="home blog centre janus" ID="home">
<DIV CLASS="tarski" ID="wrapper">
<DIV ID="header">
<DIV ID="header-image"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page"><IMG ALT="Header image" SRC="http://blogs.harvard.edu/doc/files/2014/01/gregory_blogheader3.jpg"></IMG></A></DIV>
<DIV ID="title">
<H1 ID="blog-title"><A HREF="http://blogs.harvard.edu/doc/" REL="home" TITLE="Return to main page">Doc Searls Weblog</A></H1>
</DIV>
<DIV CLASS="clearfix" ID="navigation"><UL CLASS="primary xoxo" ID="menu-menu-1"><LI CLASS="menu-item menu-item-type-custom menu-item-object-custom menu-item-8399" ID="menu-item-8399"><A HREF="http://blogs.law.harvard.edu/doc/">Home</A></LI>
...
We note sgmlproc
has properly generated end-element tags for
elements declared EMPTY such as img
, meta
, and others,
and has also put CDATA marked section markers around content
containing <
and &
characters from elements having
declared content CDATA such as script
and style
.
We can also use the xmllint
command-line program (installed
as part of libxml2 on Unix-like systems) to verify
that the output of sgmlproc
is indeed valid XML.
The result, while XML, leaves to be desired,
however, since tag and attribute names are produced in
uppercase letters. This is because the file is being processed
with the HTML 5 SGML declaration (either implicitly
by sgmlproc
when processing .html files, or explicitly
by specifying an SGML declaration at the begin of the
file to parse), and the HTML 5 SGML declaration asserts
SYNTAX NAMECASE GENERAL YES
which will generally
perform case-folding on element and attribute, and other
name tokens.
To change this, we're going to parse our input document
twice with sgmlproc
, refeeding the first parse's output
into the second run - the first time with the (implicit)
SGML declaration for HTMl 5 as before, and by using
output_format=html
(which guarantees lowercase element
and attribute names and other name tokens), and the
second time with SYNTAX NAMECASE GENERAL NO
and
output_format=xml
(note we leave out -v dtd_handling=omit
from the first invocation, as opposed to our previous run):
./sgmlproc \
-v output_format=html \
-- -o out.html \
blogpage-with-html52mini-doctype-and-custom-url.html
./sgmlproc \
-v output_format=xml \
-v dtd_handling=omit \
-v sgmldecl_syntax_namecase_general=NO \
-- -o out.xml \
out.html
We can use dtd_handling=omit
on the second invocation to
get rid of DTD declarations, which we aren't going to need
anymore for parsing, since the first parse has taken care of
normalizing enumerated attributes into canonical (XML-like)
syntax, and the second parse has taken care of producing
end-element tags for img
and other elements with declared
content EMPTY, and also of putting CDATA section markers
around script
and style
content where necessary.
Note: we're leaving OpenSP behind here as it doesn't have these and other options we're using for conversion.
For XHTML proper, W3C's HTML 5.2 specification imposes a number of additional constraints on top of requiring generic XML syntax. We're going to focus on the first two items, and leave the other ones as an exercise:
the XHTML namespace must be asserted as default xmlns
namespace binding for the document (or must be otherwise
represented in a way compatible with XML namespaces)
HTML's lang
attribute should be represented as xml:lang
the special handling of HTML's noscript
element by browsers
may make it desirable to remove noscript
elements and their
child content from XHTML-serialized HTML documents
HTML's href
attribute on a base
element, if any, should be
propagated to an xml:base
attribute on the html
document
element in order to make interpretation of relative URI
values conformant with XML assumptions
if desired, HTML's id
attribute(s) could
be represented as xml:id
to have XML validate uniqueness
of identifiers without additional DTD declarations
XLink attributes on foreign elements in HTML (actuate
,
arcrole
, href
, role
, show
, title
, type
as
xlink:actuate
and so on) must be preserved
also, xmlns
and xmlns:xlink
attributes must be preserved
(and, in general, HTML with embedded SVG and MathML must
be handled, which we're not going to do here for space
reasons)
The first requirement - that of adding an XHTML namespace
binding attribute to the html
document - is easy enough and
can be achieved by merely customizing the internal
subset for the HTML 5.2 mini-DTD by using the
following declaration in place of the one we've
used before:
<!DOCTYPE html SYSTEM "html52mini.dtd" [
<!ENTITY % if_uri_data_spec_attr "IGNORE">
<!ATTLIST html xmlns CDATA #FIXED "http://www.w3.org/1999/xhtml">
]>
(which adds an attribute list declaration to our earlier
addition for preempting the if_uri_data_spec_attr
parameter
entity for URL customization)
This will just always add (or enforce if present) an additional
xmlns
attribute on the document element. To make use of it,
we're editing blogpage-with-html52mini-doctype.html
as
described, save the result as blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html
,
and invoke
./sgmlproc \
-v output_format=html \
-v dtd_handling=omit \
-- -o out.html \
blogpage-with-html52mini-doctype-and-xhtml-ns-binding.html
in place of our earlier invocation.
While this might suffice for basic documents, for additional XHTML rules we're going to have to use additional SGML concepts for describing markup transformations (we could of course also use XML-centric tools such as XSLT to do the same at this point).
SGML link process declarations (LPDs) are an additional type of declaration set (in addition to document type declarations) supported by SGML.
LPDs in the context of a larger SGML prolog can look as follows:
<!DOCTYPE doc [
<!ELEMENT doc ... >
<!ELMENET el ... >
]>
<!LINKTYPE lnk doc #IMPLIED {
<!ATTLIST el lnkatt ...>
<!LINK #INITIAL el [ lnkatt="some value" ]>
]>
<doc>
<!- ... further content goes here ... ->
</doc>
where the LPD lnk
is declared as an implicit link process
associating link attributes to elements declared in the doc
document type. If it were declared as an explicit link process instead,
the link process could eg. take source markup according to the doc
DTD,
and produce target markup according to another DTD as specified
in place of #IMPLIED
in the link declaration. Multiple
explicit link processes can be executed to form a pipeline
where the result markup of one stage is fed as source markup
stream into the next prcoess, and pipelines can be configured
automatically by SGML based on a desired target document type
"view" requested by the user.
The <LINK ...
link set declaration in the example establishes
some value
as value for the lnkatt
link attribute on el
elements. Similar to CSS properties in HTML, link attributes
(those declared in a link process declaration) are not exposed
as content attributes by SGML, even though they're declared using
regular attribute list declaration syntax. Also much like CSS
selectors, while the example link declaration assigns a value
unconditionally, link attributes can also be conditionally
assigned to content elements based on content attribute values
and element state context.
When using OpenSP, the associated values of link attributes can
be made visible using the onsgmls
program which produce ESIS
output from markup (ESIS is a line-oriented markup representation
for easy processing with classic Unix shell tools and Perl, and
is also used in SGML test suites as reference output).
To produce the xmlns
attribute using SGML LINK with sgmlproc
,
we can make use of sgmlproc
s forward_link_attributes
option to output link attributes as regular content attributes.
We're going to declare our link process for producing XHTML in an
extra file (as external entity) rather than inline in the HTML
input document because we want to apply a link process at the second
stage (on the output of the initial html parse). We're going
to use another special command-line flag available to sgmlproc
specifically designed to behave as if a particular link process
we're giving as command-line parameter is declared in our document
when it's not actually contained in the SGML prolog.
We're not going to use the xmlns
custom attribute
technique explained in the previous section since we want to use
LPDs for this purpose; hence we're working on our basic
blogpage-with-html52mini-doctype-and-custom-url.html
file again.
Our initial link process, test.lpd
, looks as follows:
<!ATTLIST html xmlns CDATA #IMPLIED>
<!LINK #INITIAL html [ xmlns="http://www.w3.org/1999/xhtml" ]>
We invoke the first sgmlproc
run as already explained above,
and make use of our special command-line flags in the second sgmlproc
invocation (note there's also a test.html
file to check the second
command on a much smaller input):
./sgmlproc \
-v output_format=html \
-- -o out.html \
blogpage-with-html52mini-doctype-and-custom-url.html
./sgmlproc \
-v output_format=xml \
-v dtd_handling=omit \
-v sgmldecl_syntax_namecase_general=NO \
-v active_lpd_names=test \
-v system_specific_implied_lpd_names=test \
-v forward_link_attributes=YES \
-- -o out.xhtml \
out.html
sgmlproc
is instructed to activate our test
link process
by use of the active_lpd_names
parameter, and will receive
link attribute and link set declarations from the test.lpd
file (sgmlproc
looks for a file named after the implied LPD
and automatically adds the .lpd
file suffix).
The result in out.xhtml
is the same as what we produced
by adding the xmlns
attribute via DTD attribute declarations,
and will have a new xmlns
attribute set to the XHTML
namespace URI on html
.
But with this setup, as opposed to our simpler initial model,
we can now add additional conversion rules. For a start,
we're going to transform the lang
attribute into
an xml:lang
attribute, using whatever value was actually
specified in the lang
source attribute rather than hard-coding the
value in a DTD or LPD attribute declaration. To do this,
we're once again rewriting our setup to make use of
SGML templating instead of relying on forward_link_attributes
.
Templating is a versatile SGML technique introduced
with sgmljs.net SGML to replace content of source files
with that of "template" SGML files at spaces specified
in link rules or #CONREF
attributes in content
with type-safety and support for parameters.
A template file to be inserted into result markup
is a regular, standalone SGML file expected to parse
as the element type which it replaces in source markup.
For example, to add an xml:lang
attribute,
the html
element in our source document is
targetted, and recreated using a template
file such as the following:
<!DOCTYPE #IMPLIED SYSTEM [
<!ENTITY lang SYSTEM>
...
]>
<html xml:lang="&lang">
...
This template receives the lang
parameter
as SGML system-specific entity (declared
without a system identifier) and references
the entity to obtain its value in attribute
content.
DOCTYPE #IMPLIED SYSTEM
is a WebSGML
feature for determining the document
element name from the first content element in
a markup stream, and for obtaining DTD declarations
via a system-specific entity, resolved to
the file XXX.dtd
by sgmljs.net SGML
where XXX
is the name of the document element.
Actually, we could leave out the DTD here
alltogether, since DOCTYPE #IMPLIED SYSTEM
is assumed by default and IMPLYDEF ENTITY YES
semantics expressed in the default SGML declaration
allows references to undeclared (general) entities
with their declaration implied to be system-specific
(as per the WebSGML specification).
Let's take a look at the place where this template is applied in the source document:
<DOCTYPE html ...>
<!LINKTYPE xhtml [
<!NOTATION tmpl
PUBLIC
"ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
"mytemplate.sgm">
<!ATTLIST #NOTATION sgml
lang CDATA #IMPLIED>
<!ATTLIST html
template NOTATION (tmpl) #IMPLIED>
<!LINK #INITIAL html [ template=tmpl ]>
]>
<html lang="en-US">...
...
These declarations
establish the tmpl
notation as an SGML file stored in
mytemplate.sgm
with the lang
data attribute
declare a link attribute on the html
element with
declared value a notation (the tmpl
notation)
set up a link rule to assign the tmpl
notation
to the notation link attribute on html
Using the SGML public identifier in this way makes
sgmljs.net SGML apply the template on the html
element.
At runtime, the value for the LANG
data attribute
is populated from the content attribute on the html
element having the same name, as per the
ISO 10744's DAFE specification.
But we're not ready yet: since we're actually targetting
the document element in this particular case, we basically
also re-create the whole source document, including child
content of the source html
element in the template file:
<!DOCTYPE #IMPLIED SYSTEM [
<!ENTITY LANG SYSTEM>
<!ENTITY content SYSTEM "<osfd>0">
]>
<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="&LANG">
&content
</html>
We do so by declaring an additional entity resolving
to the "standard input" (Unix file descriptor 0).
<osfd>0
is a a Formal System Identifier and a notation
to refer to the standard input character stream introduced
by ISO 10744's FSIDR specification.
sgmljs.net SGML, when executing a template processing
sub-context on html
, supplies the entire child content of
the element on which the template is invoked via <osfd>0
.
In effect, our template, up to adding xml:lang
, is
acting as an identity transform. We don't have anything
special to do on the invoking site (in source markup),
because sgmljs.net SGML populates <osfd>0
by default.
So lets put our template to work by storing the
above snippet in mytemplate.sgm
. Moreover,
we create xhtml.lpd
(which we're going to use
to inject our LPD into out.html
as described before
with test.lpd
) as follows:
<!NOTATION tmpl
PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
"mytemplate.sgm">
<!ATTLIST #NOTATION tmpl
lang CDATA #IMPLIED>
<!ATTLIST html
template NOTATION (tmpl) #IMPLIED
lang CDATA #IMPLIED>
<!LINK #INITIAL html [ template=tmpl ]>
Now by invoking our sgmlproc
commands again (the first
one is unchanged from before, and the second one activates
the xhtml
link process rather than test
):
./sgmlproc \
-v output_format=html \
-- -o out.html \
blogpage-with-html52mini-doctype-and-custom-url.html
./sgmlproc \
-v output_format=xml \
-v dtd_handling=omit \
-v sgmldecl_syntax_namecase_general=NO \
-v active_lpd_names=xhtml \
-v system_specific_implied_lpd_names=xhtml \
-v forward_link_attributes=YES \
out.html
we now obtain decent XML output we can feed into (hypothetical) tools for further processing of XHTML.
download
attributes
download
, according to WHATWG specs since at least 2015
(but not in W3C HTML as of 5.2), can be used both with, or
without attribute value. In SGML terms, download
can be used as an attribute with CDATA declared value,
or can be used as a name token on <a>
anchor elements:
<a href="..." download="...">
<a href="..." download>
In our canonical output markup, we either want
to have download
used as a regular attribute
with a value (equal to href
if missing in
source markup, or otherwise with the specified
value), or don't want to have download
specified
as attribute or name token at all.
To make HTML parsing behave as expected, the following things have to be done:
an attribute (isdownload
, say) must be
declared in addition to those already declared
in the html52.dtd
or html52mini.dtd
;
the isdownload
attribute should be
able to take on the download
name token,
such that using attribute-minimized form
<a download>
will be treated same as
<a isdownload=download>
a link process needs to be applied,
dispatching on various alternatives for
specifying a download link (eg. with
explicit value for download
as opposed
mere presence of the download
name token)
While we have already customized the declaration of URI-typed attributes in our initial parsing examples above, there are additional considerations for templating URL attributes:
As part of templating <a>
anchor elements,
generally speaking, we'll be rewriting href
values captured from source markup into
entity references in attribute values specified
in a template containing markup such as
<a href="&href ...>
As already explained, the href
attribute
is declared, via the URI
parameter entity as
a data specification attribute and thus treated
as-is, rather than being entity-expanded. If we
hadn't already change the declaration of %URI
to CDATA
above, we therefore had to redeclare the
href
attribute at this point anyway, such that
&href
will be recognized as an entity reference.
The following synthetic SGML document
implements these customizations in its
internal subset, and also includes an
additional declaration for the download
name token. Moreover, it includes a link process
declaration for selecting an appropriate
template based on whether
download
is present as name token,
download
isnt't present at all (neither
as name token nor as attribute), or
download
is present as attribute,
respectively (page-with-download-link-demo.sgm
):
<!DOCTYPE html SYSTEM "html52mini.dtd" [
<!ENTITY % if_uri_data_spec_attr "IGNORE">
<!ATTLIST a
isdownload (download) #IMPLIED
download CDATA #IMPLIED>
]>
<!LINKTYPE htmlfix html #IMPLIED [
<!NOTATION a-isdownload
PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
"a-isdownload.sgm">
<!ATTLIST #NOTATION a-isdownload
isdownload (download) #IMPLIED
download CDATA #IMPLIED
href CDATA #IMPLIED>
<!NOTATION a-nodownload
PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
"a-nodownload.sgm">
<!ATTLIST #NOTATION a-nodownload
isdownload (download) #IMPLIED
download CDATA #IMPLIED
href CDATA #IMPLIED>
<!ATTLIST a
template NOTATION (a-isdownload|a-nodownload) #IMPLIED
isdownload (download) #IMPLIED
download CDATA #IMPLIED
href CDATA #IMPLIED>
<!LINK #INITIAL
a [ isdownload=download template=a-isdownload ]
a [ download="" template=a-nodownload ]
a [ template=a-isdownload ]>
]>
<html>
<head>
<title>Page containing minimized download attribute</title>
</head>
<body>
<a href="/someurl" download>Download Link</a>
<a href="/otherurl">Regular Link</a>
</body>
</html>
The content of the a-isdownload.sgm
and a-nodownload.sgm
templates, respetively is:
<!DOCTYPE #IMPLIED SYSTEM [
<!ENTITY content SYSTEM "<osfd>0">
<!ENTITY HREF SYSTEM>
]>
<a href="&HREF" download="&HREF">&content</a>
<!DOCTYPE #IMPLIED SYSTEM [
<!ENTITY content SYSTEM "<osfd>0">
<!ENTITY HREF SYSTEM>
]>
<a href="&HREF">&content</a>
To invoke production of normalized <a>
anchor
elements from this document, invoke:
./sgmlproc \
-v output_format=html \
-v active_lpd_names=HTMLFIX \
page-with-download-link-demo.html
The result markup wth normalized <a>
anchor
elements looks like this:
<html>
<head>
<title>Page containing minimized download attribute</title>
</head>
<body>
<a download="/someurl" href="/someurl">Download Link</a>
<a href="/otherurl">Regular Link</a>
</body>
</html>
download
attribute handling
With this solution, we have changed the href
attribute
declaration globally, and for all URI-typed
attributes. This customized declaration for
URI attributes (like those for all other element and attributes)
is propagated into the processing
context for the template application on <a>
.
We may want to restrict this interpretation
to only href
attributes in <a>
elements, and
to the template processing subcontext, rather
than the primary parsing context.
For the first issue, we can alternatively
preempt HTML.a.attlist
to "IGNORE"
,
and supply our own attribute declarations for
<a>
. For the second issue, we can restrict
our modification to only the internal subset
used in the template processing context for
the template being applied on <a>
elements.
For being able to influence the processing
context used in a template application,
we need to allow lax templating. Normally
(in strict templating), sgmlproc
checks
that an SGML file used as template has
<!DOCTYPE ... SYSTEM>
as document prolog,
where the declaration set resolved by SYSTEM
is propagated by the calling main parsing
context.
./sgmlproc \
-v output_format=html \
-v active_lpd_names=HTMLFIX \
-v system_specific_implied_lpd_names=htmlfix \
-v enable_lax_templates=YES \
page-with-download-link.html
See page-with-downoad-link.html
, htmlfix.lpd
,
a-isdownload-lax.sgm
, and a-nodownload-lax.sgm
implementing this variant. Note we, again,
make use of the system_specific_implied_lpd_names
feature to inject a link process declaration
(with declarations from htmlfix.lpd
) into
the parsed HTML file.
(left as an exercise)
consider validating variant types/content models
of eg. input
elements
(where an attribute determines the
content model and/or type of other attributes)
consider checking accessibility such as
checking for presence of alt
attributes
and proper use of ARIA