SGML

Web Reference

Introduction

As a straightforward application of SGML basic entity substitution and templating on the web, this file, page.sgm, shows a very simple example of a SGML document, that, when accessed as web page at http://.../page?name=Tom, will get the name entity reference substituted into Tom, and returned as response to a browser (or can be rendered entirely within a web browser running SGML User Agent):

<!doctype html [
    	<!element html - - any>
    	<!element p - - (#pcdata)>
	<!entity param system>
]>
<html>
	<head>
		<title>SGML page</title>
	</head>
	<body>
		<p conref=e>
	</body>
</html>

SGML Web Server

File name mapping

Provisioning SGML on the Web is based on interpreting HTTP request URLs as file names according to terms and concepts in use for the longest time on the Internet (https://tools.ietf.org/html/rfc3875).

For the purpose of resolving a URL to a file or other resource, a web host passes a request URL and other request parameters as value of the PATH_INFO, PATH_TRANSLATED and other system-specific entities to SGML processing, The SGML processor then either prepares HTML from SGML, or just serves static files, depending on what media type the user agent has indicated to accept in the request, and on what files are found to exist on the server's file system at the resolved location.

Interpretation and modification of PATH_TRANSLATED is analogous to what a classic CGI script receives via the SCRIPT_NAME and PATH_INFO/PATH_TRANSLATED variables: the classic scenario assumes there's a common document root (or web root) directory wherein a CGI program is looked up in a designated script directory. The program, if found, is then executed with the trailing part of the request URL (everything following the portion used to locate the program/script itself) as PATH_INFO parameter. PATH_TRANSLATED is derived from PATH_INFO by resolution against the document root, resulting in an absolute path name.

  • PATH_TRANSLATED can be alternatively computed (without knowledge of a document root directory (and provided SCRIPT_NAME is absolute against the web server's file system root, which it is typically not, however) by starting with the directory where SCRIPT_NAME resides and going back as many parent directories as there are path components in SCRIPT_NAME, then appending PATH_INFO

Request routing rules

If a static file is requested, as determined by requesting a resource name (in PATH_TRANSLATED) having a dot in its last path component

  • it is served "as-is" (as a static file), with a media type (HTTP Content-Type) derived from the file extension

    • note the HTTP Accept-header isn't checked in this case

    • commonly requested static file types include prerendered .html files as well as .css, .js, and image files

  • otherwise, if a static resource by the name of PATH_TRANSLATED name doesn't exist, a 404 NOT FOUND HTTP response is generated

Otherwise (PATH_TRANSLATED doesn't have a dot), if PATH_TRANSLATED can be resolved as a SGML file by appending the .sgm file extension

  • if text/sgml is accepted by the request

    • the SGML gateway determines scriptName as master SGML file/template and sends the static master file

  • otherwise, and by default (and if either no Accept header is present in the request or its value is text/html or a wildcard)

    • PATH_TRANSLATED is processed for producing HTML, and the output is served astext/html` response

Otherwise, (PATH_TRANSLATED cannot be resolved to a SGML file)

  • if the first path component of PATH_TRANSLATED (or the longest sequence of consecutive path steps contained twice in PATH_TRANSLATED) can be resolved as an SGML file by appending the sgm file extension,

    • if text/sgml is accepted for the response, the resolved file is served statically from the resolved file as text/sgml

    • otherwise, the resolved file is processed for producing HTML, and the output is served as text/html response; the remaining part of PATH_TRANSLATED (not including the initial part up to and including the resolved SGML file) is resolved against the web root directory to an absolute path, and supplied as the PATH_TRANSLATED system-specific entity to SGML processing

Otherwise (when the request's PATH_TRANSLATED value couldn't be interpreted in any of the ways explained) a 404 NOT FOUNDHTTP response is generated.

Whenever a SGML file is selected for processing, the file's modification date is checked against the value of the last-modified HTTP header, if present in the request. If the file is older than the last-modified value, it's not processed, and a 304 NOT MODIFIED HTTP response is returned to the client instead. In case the processed SGML is determined from the initial portion of PATH_TRANSLATED, the modification date of the file name denoted by the remaining part, is checked as well.

Producing HTML from SGML

sgmlweb produces HTTP GET responses for text/html content by

  • locating an SGML file with a .sgm file suffix in the web document root directory matching the request URI in the most obvious way

  • invoking SGML processing on the located SGML with activating

    • the HTML target document type name

    • the WEB link process

    • the HTTP link process

  • returning the produced SGML processing output to the the requestor, with HTTP response headers populated as discussed below.

This means that on SGML files in the web root directory that don't have HTML as base document type name, a link pipeline is inferred from the link process declarations present in the prolog of the SGML file such that the ultimate result document type name produced by the link pipelining is HTML. It's an error if neither the base document is HTML, nor a target document type name of HTML can be produced by forming a valid sequence of link processes from those declared in the link process.

Moreover, sgmlweb activates the WEB and HTTP link processes. If WEB and/or HTTP are declared in the document prolog, any inferred link process pipeline will always contain the WEB and HTTP link process, respectively (but either WEB and HTTP can be omitted as described below).

CGI meta-variables

The following system-specific entities are exposed to SGML and can be declared.

DOCUMENT_ROOT

directory

SCRIPT_NAME

Absolute path to master SGML file (primary SGML being accessed)

PATH_INFO

URL portion following the part identifying SCRIPT_NAME, if any

includes a leading slash character

PATH_TRANSLATED

Absolute path to the file corresponding to PATH_INFO, (if PATH_INFO is set/if there's an URL portion following the part identifying the master SGML file)

PATH_TRANSLATED_CONTENT

Content of the file corresponding to PATH_TRANSLATED, if any

REQUEST_METHOD

HTTP method used for the request (eg. GET or POST)

Additional CGI meta-variables

The following system-specific parameter entities/CGI meta-variables are additionally made available (see https://tools.ietf.org/html/rfc3875 for an explanation) when either declared manually or conditionally declared via referencing a parameter entity for the +//IDN sgmljs.net/ENTITIES CGI 1.1//EN public identifier:

Note PATH_INFO, PATH_TRANSLATED, and PATH_TRANSLATED_CONTENT are not necessarily (or even typically) passed internally from a web server to SGML, but are what SGML passes to an SGML document processing context.

Public text for PATH_INFO/PATH_TRANSLATED defaults/preemption

If a request URL consists of just a path name identifying an SGML resource no PATH_INFO, and hence no PATH_TRANSLATED etc. system-specific entities are exposed and accessing (or even declaring) those is treated as error.

To be able to process requests both with and without PATH_INFO/PATH_TRANSLATED using the same master document, CGI meta-variables can be declared using the //IDN sgmljs.net//ENTITES CGI 1.1//EN public text like this

<!DOCTYPE html ... [
	<!ENTITY % cgivars "+//IDN sgmljs.net//ENTITIES CGI 1.1//EN">
	%cgivars;
	...
]>

Referencing the +//IDN sgmljs.net//ENTITIES CGI 1.1//EN public text in a declaration set as shown is equivalent to declaring the CGI meta-variables as both system-specific general and parameter entities manually. However, the PATH_INFO and PATH_TRANSLATED entities (and the PATH_TRANSLATED_CONTENT as a general entity) are only declared if actually supplied in the processing context.

In particular, fallback values for those can be specified in the document itself. For example, the following declaration set

<!DOCTYPE html ... [
	<!ENTITY % cgivars "+//IDN sgmljs.net//ENTITIES CGI 1.1//EN">
	%cgivars;
	<!ENTITY PATH_TRANSLATED "somefile">
]>

assumes values for PATH_TRANSLATED as obtained from the trailing part of a request URL. However, if the request URI doesn't contain a trailing part following after the part that identifies the master document itself, %cgivars will leave PATH_TRANSLATED undeclared; hence the subsequent entity declaration for PATH_TRANSLATED will supply the effective value for it.

In this way, a master document can assign a fallback value for an absent file name (eg. as derived from an absent secondary path step in a HTTP request URI) for a client document such as the file name of the latest or otherwise most relevant client document of a document collection sharing a common path prefix.

HTTP response LPD

While CGI meta-variables represent data handed by the web server to SGML, HTTP response meta-variables (such as the HTTP response status) are data returned from SGML processing to the web server along with result markup as response body.

Conceptually, HTTP response meta-variables are represented as link attributes of a simple link process. A simple link process declares link attributes on the document element of the response body carrying HTTP response meta-variables. HTTP response link attributes are declared in a distinguished link process declaration identified by the +//IDN sgmljs.net//LPD HTTP 1.1//EN and +//IDN sgmljs.net//LPD HTTP 2.0//EN public text identifiers.

These LPDs behave as if declared as follows:

<!ENTITY % HTTP_RESPONSE_STATUS "200">
<!ENTITY % HTTP_RESPONSE_CONTENT_TYPE "text/html">
<!ENTITY % HTTP_RESPONSE_LOCATION "">
<!ENTITY % HTTP_CACHE_VALIDATION_ENTITIES "">
<!ATTLIST html
	status NUMBER #FIXED %HTTP_RESPONSE_STATUS
	location CDATA #FIXED "%HTTP_RESPONSE_LOCATION"
	content-type CDATA "%HTTP_RESPONSE_CONTENT_TYPE"
	cache-validation-entities ENTITIES "%HTTP_CACHE_VALIDATION_ENTITIES"
	...>

To make sgmlweb return values for response meta-variables to user agents other than the defaults, a master document declares an LPD with one of the distinguished LPDs as external subset, and then preempts one or more of the parameter entities used as default values for link attributes. For example, the following master document makes sgmlweb send a 404 HTTP status to a web browser:

<!DOCTYPE html ... [
]>
<!LINKTYPE http PUBLIC "+//IDN sgmljs.net//LPD HTTP 1.1//EN" [
	<!ENTITY % HTTP_RESPONSE_STATUS "404">
]>
...

Note that the name of the link process must be http; other LPDs referencing the public identifier for HTTP response meta-variables won't get activated (and hence ignored) by sgmlweb.

Note that HTTP_RESPONSE_STATUS and other parameter entities can be preempted from any link process, not just from the http link process, subject to declaration set preemption.

Note since the LPD determines the names of parameter entities it accepts as #FIXED values, there's no need to have link processing determine link attributes; all that has to happen is that a respective LPD ("deriving" from a distinguished response LPD) is declared and activated; the effective values of the respective parameter can be queried from entity management just as request parameters.

In effect, specifying values for response meta-variables is syntactically very similar to declaring request parameters.

Note the LPD is tied to the html response document element, though.

Response meta-variables

The following parameter entities, when declared/preempted as described, have these respective meaning:

HTTP_RESPONSE_STATUS

numeric HTTP status to respond

default: 200

the valid HTTP response status codes are 100, 101, 200, 201, 202, 203, 204, 205, 206, 300, 301, 302, 303, 304, 305, 307, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 426, 451, 500, 501, 502, 503, 504, and 505

of those, all requests with non-2xx or with 204 NO CONTENT status get terminated after prolog processing, without producing a response body, and a generic response body for 4xx and 5xx responses, if applicable, is populated by the web server instead

the reason phrase (such as NOT MODIFIED for 304 responses) are also generated by the web server

on a 301 MOVED PERMANENTLY, 302 FOUND, 303 SEE OTHER (on POST), and 307 TEMPORARY REDIRECT response status, a redirect URL is configured in the HTTP_RESPONSE_LOCATION parameter entity; it is an error if the HTTP_RESPONSE_LOCATION isn't declared/preempted (and will lead to a 500 INTERNAL SERVER ERROR response)

HTTP_RESPONSE_LOCATION

Redirect-URL for 301 MOVED PERMANENTLY, 302 FOUND, 303 SEE OTHER and 307 TEMPORARY REDIRECT; see above

HTTP_RESPONSE_CONTENT_TYPE

response media type

default: text/html

preempted values must match the syntax of an RFC 1521 media type (such as application/xhtml+xml with optional parameters) and have text or application as (main) type; other values will lead to a 500 INTERNAL SERVER ERROR generic error response

Response caching meta-variables pro

HTTP_CACHE_VALIDATION_ENTITIES

space-separated list of entity names naming file names declared in the document prolog, the youngest of which is used to populate the value of the Last-Modified date HTTP header

URI query parameters

The QUERY_STRING system-specific entity contains the query part of an URL if the request URL contains a query such as in http://example.com/page?param=value...,

Moreover, the value of QUERY_STRING is parsed according to the rules for HTML form-encoding. and any individual query parameters (such as param in the above example) are made available to SGML processing and can be accessed by declaring a system-specific (general or parameter) entity with the respective name.

Note case-folding of system-specific entities is applied according to SGML declaration or out-of-band settings while URI query parameters are not made uppercase when supplied as values for system-specific entities. This means that when NAMECASE ENTITY YES is effective for SGML processing, query parameters must be supplied in uppercase letters in the request URI, and must be supplied in the request URI in the exact sequence of upper- and lowercase letters specified used for the system-specific entity declaration otherwise.

Note mapping query parameter to system-specific entities is only useful if the query parameters in an URI are unique; when the URI contains multiple key=value pairs for the same key (such as produced by browsers when submitting an HTML form with multiple same-named fields), this model of accessing and processing query parameters isn't useful, and the technique of parsing the raw QUERY_STRING value via SGML short references as explained below is used instead.

URI parameter lexical type checks and injection prevention

System-specific entities received from HTML form-like query parameters in HTTP GET or POST request URIs can be escaped by declaring those as external data text entities eg.

    <!NOTATION some-notation SYSTEM "...">
    <!ENTITY uri_parameter1 SYSTEM CDATA some-notation>

Characters used for markup delimiters such as < and & in replacement text for data entity references get replaced by numeric character entity references when expanded in content and are not interpreted as markup delimiters by SGML.

Note: System-specific entities not declared as data entities don't receive HTML escaping, thus can potentially contain malicious markup such as script elements, that, when expanded into a context without further constraints (such as element exclusion exceptions) generally represent a security threat. Therefore, declaring HTML form-like parameters in URLs or transferred otherwise should generally be declared as data entities, unless these transferred entities should explicitly represent markup as explained below.

HTML form input value checking

While declaring an entity as data entity as shown already ensures escaping of HTML markup delimiters, sgmlweb provides a distinguished notation with the public identifier

+//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN

representing lexical value spaces of HTML form input values.

The notation representing form input value types is declared as follows:

<!NOTATION html5-form-input
	PUBLIC
	"+//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN">

<!ATTLIST #NOTATION html5-form-input
	TYPE (text|email|url|number|date|time|datetime) text
	PATTERN CDATA #IMPLIED>

A system-specific entity can make use of this notation to obtain extended lexical value checks. For example, the declaration

<!ENTITY formparam SYSTEM CDATA html5-form-input [ type="email" ]>

declares the form-like input parameter formparam as having type email; sgmlweb will check the received value against the lexical rules for email addresses described in the HTML 5 specification (eg. allow me@home but not me).

Likewise, entities declared as form input value data entities with text, url, number, date, time, or datetime-local type are checked against the respective validation rules of the HTML 5 specification.

Any HTML form input value type will have the effect that value normalization is performed on the respective entity replacement value. The most general text input value accepts any input text, and performs value normalization by removing all newline characters from the entity value. Other types perform value checks and normalizations in addition to the value normalization performed for text type values. Of the supported input types for validation, only the number input types performs additional value normalization (namely, any + characters are removed, an uppercase letter E exponent separator is changed to lowercase, and a leading zero is added where a number value begins with a decimal point/dot character).

The optional pattern data attribute can contain a regular expression in JavaScript syntax that the replacement value is checked against in addition to the check and normalizations implied by the type attribue.

Note that not all features of JavaScript regular expressions are supported. In particular, lookahead operators, Unicode code points, and other PCRE-specific constructs other than the \d, \w, and \s special symbols are not available.

HTML form input lexical types are also available for WebSGML attribute data specifications. A declaration such as

<!ATTLIST elmt attr DATA html5-input [ type="email" ]>

will make sgmljs.net SGML check and value-normalize the attr value against the rules for email addresses.

Likewise, data attribute can also be declared for notations, such as used for template notations. For example, the following declaration

<!NOTATION sgml ...>
<!ATTLIST #NOTATION sgml attr DATA html5-input [ type="email" ]>

declares the attr data attribute of the sgml notation as having type email, and enforced lexical checks for any value passed in to the sgml notation template.

Error Handling

A 400 BAD REQUEST HTTP response status is emitted by sgmlweb when form input validation fails on one or more system-specific entities supplied via HTML form-like GET variables (or, equivalently, variables POSTed in application/x-www-form-urlencoded request bodies). A 4xx status is only generated when the validation is performed on a data entity declared as having HTML form input lexical value, rather than as a data attribute declared as having a HTML form input lexical value (a DATA attribute), for which an unspecifc 500 HTTP status is reported instead.

The latter restriction is because form input validation on DATA attributes

  • might not be tightly traceable to an input value (eg. because the input value is composed of expanded general entity replacement text)

  • is performed lazily as part of content parsing rather than prolog parsing (hence can't be reported as HTTP status in a HTTP header which must be determined before content parsing).

A 400 BAD REQUEST signifies to the client/web browser that a request is malformed (whereas a 5xx status can be interpreted as advice to retry a request at a later time), so is generally the more appropriate status to return on form input value errors.

HTTP POST requests

sgmlweb built-in support for HTML form-like GET with query parameters contained in the request URI is also applied automatically on HTML form POST requests having application/x-www-form-urlencoded media type (HTML form-like GET requests include static requests with an URI query part parameters indistinguishable from form-like URI query parameters).

Values transferred in HTML form POST request bodies with application/x-www-form-urlencoded media type are exposed exactly the same as request URI parameters. As far as query parameters are concerned, POST request bodies differ from form GET requests only in that they transfer query parameters in the request body rather than as part of the request URI. Therefore, the request URI for form POST requests with application/x-www-form-urlencoded request bodies must not contain a query part, since the query part is assumed to be contained in the request body.

Raw query string access

For HTTP POST queries (TODO: what about GET?) with application/x-www-form-urlencoded media type, the CGI-meta variable QUERY_STRING is exposed as system-specific entity to the SGML processing context. In addition, QUERY_STRING_DECODED is provided, containing a variant encoding for QUERY_STRING where the & (ampersand) characters are replaced by ; (semicolon) characters so as to be more useful for interpretation as SGML.

Specifically, raw query string (QUERY_STRING_DECODED) processing via SGML short references is necessary when the URI query string contains multiple key=value pairs for the same key, such as is commonly emitted from HTML forms containing tabular data or multiple repeated field groups.

Raw request body access

Request bodies other than those with application/x-www-form-urlencoded content type (which are read and processed as described above) can be accessed via <osfd>0 eg.

    <!NOTATION some-notation SYSTEM "...">
    <!ENTITY raw_reqest_uri_query_part SYSTEM "<osfd>0" CDATA some-notation>

Error response bodies

If processing results in an 4xx or 5xx efffective HTTP status, either by regular sgmlweb processing or via setting a custom HTTP response status as described above, sgmlweb attempts to render an error response body. Rendering error responses is no different from rendering regular SGML pages, but any request parameters are cleared and not available because the request parameters might be erroneous or malicious, hence might make rendering an error response body fail again for the same reason that regular sgmlweb rendering has already failed for the requested URL and request parameters.

When rendering an error response, the system-specific entity STATUS is available (as a read-only entity) in the processing context.

A simple error page might look like the following example:

<!DOCTYPE html SYSTEM "about:legacy-compat" [
    	<!ENTITY STATUS SYSTEM>
]>
<html>
	<head>
		<title>Error &STATUS</title>
	</head>
	<body>
		<p>Error serving requested page</p>
	</body>
</html>

Error page rendering is Implemented as a separate HTTP request cycle in Node.js implementations and visible and customizable via JavaScript code, whereas it appears hardcoded as part of sgmlweb request processing in other environments.

Likewise, the name of the error page (/error.sgm) is hard-coded in some sgmlweb builds, but can be chosen freely as part of the configuration of regular request processing handler chains in other sgmlweb builds (such as for Node.js).

File descriptor setup

Files resolved by the SGML Web Server Gateway itself are passed as open file descriptors to SGML processing, such that those can be accessed using <osfd> FSIs. The processing environment can access up to five file descriptors:

  • 0 (stdin); contains POSTed body content, when used and supported in the request

  • 1 (stdout): output of SGML processing; can be a file or a buffer

  • 2 (stderr); error output and log destination

  • 3 main input; contains character data from master file (resolved using either the complete path as of the initial value of PATH_TRANSLATED, or just the first path component of PATH_TRANSLATED otherwise)

  • 4 file descriptor containing character data resolved using the remainder potion of PATH_TRANSLATED if file descriptor #3 was resolved using only the first path component

File preopening pro

Before passing control to core SGML processing, the SGML gateway (on select execution environments) pre-opens the scriptName filename as /dev/fd/N, and PATH_TRANSLATED, if relevant, as /dev/fd/N+1.

Accessing open file descriptors rather than opening files by path name from main SGML processing as needed avoids race conditions and has generally desirable properties wrt. exploiting POSIX file system guarantees for atomic/continued/high-available content delivery in the presence of concurrent content change and maintenance.

Specifically, this is done to be able to guarantee that, after SGML prolog parsing, no request processing will fail due to missing template and/or client document files, and that the content of determined files remains accessible to SGML processing even if it is subject to concurrent change or deletion during processing.

In particular

  • the template and client files (if any) are held open by the SGML Web Server Gateway process, so those files can be atomically changed while request processing on previous content is underway (due to Unix file system guarantees)

  • note that these guarantees do not hold for further external entities other than the content of PATH_TRANSLATED from the main template file itself that might be referenced from the template file (or the client file if it is atypically transcluded as template and can have entity declarations)

  • a proper 404 NOT MODIFIED HTTP status can be send, rather than beginning the response with a 200 OK status and then detecting non-existence during content processing and having e.g. to include error message character data along with user content

  • to this aim, SGML processing takes advantages of being designed such that no output character data is written before the first content arrives in output buffer handling, ie. prolog data is buffered until actual content begins, hence HTTP 404 NOT FOUND or other non-default status can be set at the end of SGML prolog processing

Response status

Request processing is performed under the assumption of sending a default HTTP result of 200 OK unless set explicitly to another status, and assuming that, in general, the HTTP result status can't be changed once any output has been emitted.

Request processing is performed such that the complete SGML prolog of the document instance to process is validated before emitting any output. On any prolog parsing errors (including when system-specific parameter entities couldn't be resolved), processing is aborted, and a proper 404 NOT FOUND or 500 INTERNAL SERVER ERROR HTTP status, depending on whether eg. an operating system error of ENOENT or non-ENOENT, resp. was encountered) is generated.

Parsing, resolution, or other errors during content parsing, on the other hand, can't typically be reported via HTTP error status codes because response headers will have been sent to the client alraedy by the time a content error is encountered.

Conditional requests pro

As already explained for the individual routing branches, based on the above sketched file name resolution for PATH_TRANSLATED, before actually accessing and sending content, up-to-datedness of the client document is checked; if it hasn't changed since the date and time of the last modification, a 304 NOT MODIFIED HTTP response is generated. Note that access policies etc. don't play into here as the content body isn't transferred with 304 responses.

SGML User Agent

The SGML User Agent (sgml-ua.js) is a JavaScript (ES5) program for web browsers designed to produce HTML from SGML in the same way that HTML is produced from SGML on a SGML Web Server, thereby transparently offloading SGML processing to the browser, and at the same time saving network bandwidth by avoiding redundant network transfer of repeated partial page content.

While the SGML User Agent is designed to run against a SGML Web Server, it can also run against any other (e.g. simple static) web server lacking SGML support, in browser-only mode, with reduced user agent functionality. More generally, a SGML web setup can involve:

  • both server and browser processing: SGML is rendered transparently on either, or both, the server (for the initial page load) and in the browser, depending on whether JavaScript is enabled/allowed in the browser

  • browser-only processing: SGML files are accessed as static files from the web server and then rendered into a displayed HTML DOM on the browser

  • server-only processing: SGML pages are sent as server-rendered HTML pages to browsers; Javascript support on the browser isn't required

The SGML User Agent, when started, determines if the web page is running from a server with server-side SGML support by inspecting page metadata in the HTML head element. If the head element does not contain

<link rel="alternate" type="text/sgml" ...>

then the SGML User Agent assumes it is running off a web server without server-side SGML rendering support.

Server-side SGML support is required for proper session history, whereas when server-side SGML support isn't advertised via the link element as shown, browser-refresh and back-navigation from an external site linked to from the SGML webpage will take the user to the initial landing page (the static or prerendered HTML page carrying the SGML User Agent script). Morevoer, bookmarking works only with server-side SGML support.

The basic functionality of the SGML User Agent is to, on window.onload, attach click handlers to the current document's local (same-domain) links performing SGML page rendering (transforming SGML content to HTML/DOM).

Specifically, this is enabled on anchors that have the same effective protocol/host/port as the invoking page, and that either have no type attribute specified, or have text/sgml specified as its value.

Once a page is rendered using SGML, its anchors get captured by SGML event handling for further navigation within the domain name, in turn.

History maintenance

History pushState()/popState() works in a natural way for basic forward and backward navigation): if we're about to navigate to another page (on the same domain so rendered via SGML), we're just storing the previous page via pushState(), with the URL used to fetch SGML (or HTML on the initial page). When we return to this state via backward navigation, the popstate event handler will pop the state and start re-rendering the HTML from the pushed href URL in the same way the SGML page was rendered when first visited.

On a page (browser) refresh, the browser reloads window.location using the regular browser page loading algorithm. When the server can render SGML to HTML (as triggered by an HTTP Accept header favouring HTML over SGML) there's nothing special to do here, since the re-visited page gets rendered server-side (and carries the sgml-ua.js script to attach to link handlers for further browser-local SGML processing, just like on an initial page load).

Static serving support

On the other hand, when working against a static web server, window.location will fetch SGML text, and browsers will render the SGML code text as either plain text (Chrome) or possibly broken HTML (FF, IE). Therefore, support for static servers involves further browser history manipulation.

Blocking browser refresh isn't possible in general. There exist various attempts/scripts to accomplish this by either

  • intercepting key events (but these techniques won't handle clicking on a browser refresh icon button); what can be achieved here (by either returning a non-void value from beforeunload event handling, by setting the event's returnValue, or by calling preventDefault), is to bring up a "do you really want to leave?" warning, but the reload action as such can't be prevented.

  • establishing a new browser context in an iframe or a HTML4 frame (see e.g. Disabling the Back Button), but these techniques are generally considered user-hostile

  • not creating history entries in the first place; ie. according to Back Button Behavior on a Page With an iframe (although being about iframes mostly) replaceState() can be used to suppress creating history entries; namely, if, on a click event, the navigated-to href is replaceState()d into the same as the top-most one then no history entry is created; this could be used to block any backward (and forward) navigation, but (if it actually works) is overreaching since it will disable plain backward navigation between SGML rendered pages, which isn't a a problem even against servers without server-side SGML rendering.

So what SGML User Agent does is to ensure that, while a page is up, window.location points to the landing page of the current page, ie. the page through which the site was entered, which in many cases will be the site's home page, but could be any page carrying the sgml-ua.js script. As final part of SGML rendering (after rendered link target URLs in the generated page have been changed into absolute/resolved form), window.location is set to the landing page URL. When the page is left, the history entry for the page view is then restored to the original SGML resource URL, rather than the landing page URL, so that on plain back navigation, regular popstate handler execution will render the SGML resource.

The original resource URL is stored in the history entry's data field (for as long as it is shadowed by the landing page URL).

Executing history restoration globally on the unload/beforeunload (or even pagehide/pageshow) events isn't possible, since Ajax page loads don't trigger those events. Therefore, history restoration is executed on individual outgoing link click events, along with SGML processing for the new page.

Note that no history restoration takes place (the landing page history entry is kept) on outgoing external links since those don't get a click handler for SGML processing attached and hence exhibit standard browser behavior on link activation. For external links we're going to loose the JavaScript execution context and can't register handlers; when navigating back from an external page we must therefore enter a HTML (not SGML) page carrying the sgml-ua.js script.

  • SGML User Agent starts out on an initial landing page carrying this script

  • following a link on the page in the same domain will result in rendering the link target using SGML

  • backward-navigation (to either an earlier rendered SGML page or the landing page) is also be performed using SGML

  • navigation/following links to external sites will end SGML UA execution and continue with standard browser HTML loading/rendering; on return to a SGML-rendered site, if the site is running on static web server without server-side SGML rendering, the landing page, not the page through which the site was left, is reloaded; only if using server-side SGML can the proper page of departure be loaded

  • page refreshes take the user to the landing page when not served from a web server with support for server-side SGML; only if using server-side SGML will the current page be reloaded (it's not possible to intercept browser behavior on refresh)

  • when running off a static web server without server-side SGML rendering, context menus (as activated by right-click or long-click/hold) are blocked to prevent the Open Link in new tab being offered and bookmarking (both of which won't work against a static server)

Application development utilities pro

SQL data rendering and complex form handling

This section describes facilities for publishing tab-separated value data streams produced from SQL queries or other sources into HTML markup via bundled functions for dynamically generating required markup declarations.

Moreover, a technique for implementing an endpoint for HTML forms submission with support for tabular data insertion (where data is presented to SGML as query string with possibly multiple repeating field groups) is explained.

Producing HTML content from SQL data

As explained in the context of short reference parsing, tab-separated values pulled-in from an external source such as a file or SQL query can be made available as markup elements using short reference use and map declarations specific to a particular tab-separated data source stream.

sgmljs.net SGML provides, via custom storage manager notations, bundled functions for automatically generating required short reference declaration for TSV parsing given the names of attributes provided as data attributes. Moreover, further provided markup declaration generators are used to extend basic TSV-parsing to a generic mechanism for formatting tab-separated values, by feeding result tabular data rows obtained from TSV parsing into SGML templating.

For supplying parsed TSV data values to templating, input data must be provided as markup attributes rather than elements. To collect a sequence of (text contents of) elements produced from TSV parsing into attributes, sgmljs.net SGML uses the techniques of

  • re-mapping of element content into attributes provided via the NotNames attribute (part of ISO 10744 DAFE support) when a template notation is applied on an element in a link process

  • propagating attribute values to preceding sibling elements via #CURRENT link attribute default semantics.

To demonstrate these techniques, consider the following example input document representing possible output of TSV parsing setup as discussed in record boundary insertion:

<!DOCTYPE tsv [
	<!ELEMENT tsv - - (record+)>
	<!ELEMENT record - - (field1,field2,field3)>
	<!ELEMENT field1 - - (#PCDATA)>
	<!ELEMENT field2 - - (#PCDATA)>
	<!ELEMENT field3 - - (#PCDATA)>
]>
<!DOCTYPE table [
	<!ELEMENT table - - (tr+)>
	<!ELEMENT tr - - (td+)>
	<!ELEMENT td - - (#PCDATA)>
]>
<!LINKTYPE lpd tsv table [
	<!NOTATION field1-template
		PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
		"template.sgm">
	<!NOTATION field2-template
		PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
		"template.sgm">
	<!NOTATION field3-template
		PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
		"template.sgm">
	<!ATTLIST #NOTATION (field1-template|field2-template|field3-template)
		field1 CDATA #CURRENT>
	<!ATTLIST #NOTATION (field2-template|field3-template)
		field2 CDATA #CURRENT>
	<!ATTLIST #NOTATION field3-template
		field3 CDATA #IMPLIED>
	<!ATTLIST (field1|field2|field3)
		field1 CDATA #IMPLIED
		field2 CDATA #IMPLIED
		field3 CDATA #IMPLIED
		template
			NOTATION (field1-template|field2-template|field3-template)
			#IMPLIED
		NotNames CDATA #IMPLIED>
	<!LINK #INITIAL
		tsv table
		record tr
		field1 [ template=field1-template NotNames="field1 #CONTENT" ] #IMPLIED
		field2 [ template=field2-template NotNames="field2 #CONTENT" ] #IMPLIED
		field3 [ template=field3-template NotNames="field3 #CONTENT" ] td>
]>
<tsv>
	<record>
		<field1>first value</field1>
		<field2>second value</field2>
		<field3>third value</field3>
	</record>
	<!-- further data following here:
	<record>
		<field1>...</field1>
		...
	</record>
	-->
</tsv>

For this example, template.sgm is expected to contain:

<!DOCTYPE #IMPLIED SYSTEM [
	<!ENTITY field1 SYSTEM>
	<!ENTITY field2 SYSTEM>
	<!ENTITY field3 SYSTEM>
]>
<tr>
	<td>&field1</td>
	<td>&field2</td>
	<td>&field3</td>
</tr>

The result markup of processing the example document with the lpd link process activated is as follows (omitting commented text):

<table>
	<tr>
		<td>first value</td>
		<td>second value</td>
		<td>third value</td>
	</tr>
</table>

The link process contains template notation declarations for the individual fieldN elements, and link rules applying template notation on the respective fieldN elements. Crucially, the result element of all the link rule except the field3 element (the last element of a 'recordcontent model) is#IMPLIED, meaning that the template is applied if the source element (eg.field1orfield2) can be placed into the result context. Since neitherfield1 norfield2can appear anywhere in the result HTML-like tablecontent, no template will apply on thefield1and field2elements. The link rules placed on thesefieldN` elements exists solely

  • for applying NotNames rules, which collect the text content of the element on which the template is placed into the field1 and field2 values, respectively (and also according on field3)

  • for updating the current values for field1 and field2, respectively, in the link attribute processing context

Since the field3 elements shares the declarations for field1 and field2 as #CURRENT link attributes, the value for field1 and field2 is transported to the context for the field3 element, where the field3-template is applied as regular template, since the result element tr is expected/admitted at the result context position.

In this way, the content of the field1, field2, and field3 originally in the input source are propagated to the field1, field2, and field3 data (DAFE) attributes, and hence available as entities in the sub-processing context for applying template.sgm on field3.

sgmljs.net SGML provides the built-in storage manager notations tsv_element_decl, tsv_entity_decl, tsv_shortref_decl, tsv_usemap_decl, 'tsv_notation_decl,tsv_linkattr_decl, andtsv_linkrule_declto generate markup declarations from 'fields and params data attributes as required for the above declaration fragments.

The above example rewritten to use markup declaration generators for fetching TSV records and and applying a formatting template looks as follows:

<!DOCTYPE tsv [
	<!NOTATION sql SYSTEM>
	<?IS10744 FSIDR sql tsv_element_decl tsv_entity_decl tsv_shortref_decl tsv_usemap_decl tsv_notation_decl
	  FSIDefDoc="+//IDN sgmljs.net//DTD FSISM TSV parsing declaration utilities//EN">
	<!ELEMENT tsv - - (record+)>
	<!ELEMENT record - - (name,gender_cd)>
	<!ENTITY % element-decls
		SYSTEM '<tsv_element_decl container="tsv" record="record" fields="name gender_cd">'>
	<!ENTITY % entity-decls SYSTEM '<tsv_entity_decl container="tsv" record="record" fields="name gender_cd">'>
	<!ENTITY % shortref-decls SYSTEM '<tsv_shortref_decl container="tsv" record="record" fields="name gender_cd">'>
	<!ENTITY % usemap-decls SYSTEM '<tsv_usemap_decl container="tsv" record="record" fields="name gender_cd">'>
	%element-decls;
	%entity-decls;
	%shortref-decls;
	%usemap-decls;
	<!ENTITY % query-results SYSTEM
		"<sql>connect 'Driver=SQLite;Database=test.db'
		      set headings off
		      set colsep '    '
		      select name, gender_cd
		      from names_tbl
		      where gender_cd = 0 order by name;">
        <!ENTITY query-results "%query-results">
]>
<!DOCTYPE table SYSTEM [
	<!-- <!ELEMENT table - - (tr+)>
	<!ELEMENT tr O O (td+)>
	<!ELEMENT td - - (#PCDATA)> -->
]>
<!LINKTYPE lnk tsv table [
	<?IS10744 FSIDR tsv_notation_decl tsv_linkattr_decl tsv_linkrule_decl
	  FSIDefDoc="+//IDN sgmljs.net//DTD FSISM TSV parsing declaration utilities//EN">
	<!ENTITY % notation-decls SYSTEM '<tsv_notation_decl container="tsv" record="record" fields="name gender_cd" template_sysid="sql-names-gendercd-query-with-aggregation-into-last-field2-referenced-template.sgm">'>
	%notation-decls
	<!ENTITY % linkattr-decls SYSTEM '<tsv_linkattr_decl container="tsv" record="record" fields="name gender_cd">'>
	%linkattr-decls
	<!ENTITY % linkrule-decls SYSTEM '<tsv_linkrule_decl container="table" record="tr" fields="name gender_cd">'>
	<!LINK #INITIAL
		tsv table
		%linkrule-decls>
]>
<tsv>
&query-results</tsv>

With the given values for the container, record, and fields data attributes as supplied in the example prolog, the respective storage manager notation FSIs generate markup declaration text corresponding to fragments of the initial example.

See the Bundled Modules API documentation for the detailed description of the tsv_element_decl, tsv_entity_decl, tsv_shortref_decl, tsv_usemap_decl tsv_notation_decl, tsv_linkattr_decl, and tsv_linkrule_decl functions of the bundled tsvparsing module.

Shorthand notations for rendering SQL data into HTML

As a convenience, an SGML document for generic SQL selection as just described can be constructed by just using the +//IDN sgmljs.net//NOTATION SQL query formatting template for HTML table element//EN public identifer (or some of its variants such as for producing a tbody element instead). The following example shows a complete (simplified) HTML-like document where SQL data is rendered into an HTML table element with tr elements as record (row) container, and td as data cell element:

<!DOCTYPE HTML [
	<!ELEMENT HTML O O (TABLE|P)+>
	<!ELEMENT TABLE - - (TR+)>
	<!ELEMENT TR - - (TD+)>
	<!ELEMENT TD - - (#PCDATA)>
	<!ELEMENT P O O (#PCDATA|A)+>
	<!ELEMENT SPAN - - (#PCDATA)>
	<!ATTLIST SPAN PROPERTY CDATA #IMPLIED>
	<!ELEMENT A - - (#PCDATA)>
	<!ATTLIST A HREF CDATA #IMPLIED TITLE CDATA #IMPLIED>
	<!ATTLIST TABLE REF ENTITY #CONREF PROPERTY CDATA #IMPLIED>
]>
<!LINKTYPE LISTBOOKS #SIMPLE #IMPLIED [
	<!NOTATION SGML PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN">
	<!NOTATION SQLQUERY PUBLIC "+//IDN sgmljs.net//NOTATION SQL query formatting template for HTML table element//EN">
	<!ATTLIST #NOTATION SQLQUERY
		SUPERDCN NAME #FIXED SGML
		FIELDS NAMES #FIXED "NAME"
		PARAMS NAMES #FIXED "GENDER_CD"
		GENDER_CD CDATA #REQUIRED
		TEMPLATE_SYSID CDATA #FIXED '<literal><tr><td>&name</td></tr>'>
	<!ENTITY FEMALENAMES SYSTEM "<literal>
		set colsep '	'
		set underline off
		connect 'Driver=SQLite;Database=/tmp/test.db'
		select name from names_tbl where gender_cd = cast('&gender_cd' as decimal);"
		NDATA SQLQUERY [ GENDER_CD="0"]>
	]>
	<html>
		<table ref=femalenames>
	</html>

SQL data insertion in response to POSTed data content

For executing SQL INSERT or other statements from POSTed URL-encoded data as would be produced from a HTML form having enctype of application/x-www-form-urlencoded with potentially multiple repeating groups of query keys/fields, SGML similar to the following boilerplate SGML can be used (where instead of actual SQL invocation using a sql storage manager notation a literal template for rendering the supplied values as tr elements is used instead:

<!DOCTYPE doc [
	<!ELEMENT doc - - (sub+)>
	<!ELEMENT sub O O (key,value,key,value)>
	<!ELEMENT key - - (#PCDATA)>
	<!ELEMENT value - O (#PCDATA)>
	<!ENTITY start-key "<key>">
	<!ENTITY end-key-start-value "</key><value>">
	<!ENTITY end-value-start-key "</value><key>">
	<!SHORTREF in-doc ";" start-key>
	<!SHORTREF in-key "=" end-key-start-value>
	<!SHORTREF in-value ";" end-value-start-key>
	<!USEMAP in-doc doc>
	<!USEMAP in-key key>
	<!USEMAP in-value value>
]>
<!DOCTYPE html [
	<!ELEMENT html - - (table)>
	<!ELEMENT table O O (tr+)>
	<!ELEMENT tr - - (#PCDATA)>
	<!ATTLIST tr f_attr CDATA #IMPLIED g_attr CDATA #IMPLIED>
]>
<!LINKTYPE lnk doc html [
	<!-- entity supplied by sgmlweb containing a semicolon-separated
	     (rather than ampersand-separated) URI query string -->
	<!ENTITY QUERY_STRING_DECODED SYSTEM>
	<!-- dummy notation(s( just for enjoying #CURRENT attribute propagation;
	     not actually executed -->
	<!NOTATION aggregate-current-values
		PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
		"non-existant.sgm">
	<!NOTATION check-current-key-is-f
		PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
		"non-existant.sgm">
	<!NOTATION check-current-key-is-g
		PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
		"non-existant.sgm">
        <!NOTATION formatting
		PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN"
		"<literal><tr f_attr='&f_attr' g_attr='&g_attr'></tr>">
	<!ATTLIST #NOTATION (formatting|aggregate-current-values)
		f_attr NUMBER #CURRENT>
	<!ATTLIST #NOTATION check-current-key-is-f
		key CDATA #FIXED "f">
	<!ATTLIST #NOTATION check-current-key-is-g
		key CDATA #FIXED "g">
	<!ATTLIST #NOTATION formatting
		g_attr CDATA #CURRENT>
	<!ATTLIST (key|value|sub)
		f_attr CDATA #IMPLIED
		g_attr CDATA #IMPLIED
		NotNames CDATA #IMPLIED
		key CDATA #IMPLIED
		template NOTATION (check-current-key-is-f|check-current-key-is-g|aggregate-current-values|formatting) #IMPLIED>
        <!LINK #INITIAL
		doc html
		key #POSTLINK after-key-f [ template=check-current-key-is-f NotNames="key #CONTENT" ] #IMPLIED>
	<!LINK after-key-f
		value [ template=aggregate-current-values NotNames="f_attr #CONTENT" ] #IMPLIED
    		key #POSTLINK after-key-g [ template=check-current-key-is-g NotNames="key #CONTENT" ] #IMPLIED>
	<!LINK after-key-g
		value [ template=formatting NotNames="g_attr #CONTENT" ] tr>
]>
<doc>
<key>&QUERY_STRING_DECODED</sub>

</doc>

If the value for QUERY_STRING_ENCODED were supplied as f=1;g=value1g;f=2;g=value2g, such as when the SGML document were accessed via POSTing to http://../..?f=1&g=value1g&f=2&g=value2g, this document, when activating the lnk link process, would invoke the formatting template on each logical data row represented by a record element with the f_attr and g_attr attributes corresponding to the respective database column.

To collect URL query parameters in repeating groups into elements, the document

  • makes use of short reference to rewrite equals (=) and semicolon characters into <key>...</key><value>...</value><key>...</key>... element sequences

  • then inserts, via SGML tag inference and constraining the respective content model, an enclosing sub element, acting as record container element

  • then uses link processing with NotNames to collect element content of value elements into f_attr and g_attr attributes, respectively.

Moreover, the expected sequence of values for the key element content (eg. f on every odd, and g on every even element) is enforced.