SGML

Web Reference

SGML Web Server

As a straightforward application of SGML basic entity substitution and templating on the web, this file, page.sgm, shows a very simple example of a SGML document, that, when accessed as web page at http://.../page?name=Tom, will get the name entity reference substituted into Tom, and returned as response to a browser (or can be rendered entirely within a web browser running SGML User Agent):

<!doctype html [
    	<!element html - - any>
    	<!element p - - (#pcdata)>
	<!entity param system>
]>
<html>
	<head>
		<title>SGML page</title>
	</head>
	<body>
		<p conref=e>
	</body>
</html>

File name mapping

Provisioning SGML on the Web is based on interpreting HTTP request URLs as file names according to terms and concepts in use for the longest time on the Internet (https://tools.ietf.org/html/rfc3875).

For the purpose of resolving a URL to a file or other resource, a web host passes a request URL and other request parameters as value of the PATH_INFO, PATH_TRANSLATED and other system-specific entities to SGML processing, The SGML processor then either prepares HTML from SGML, or just serves static files, depending on what media type the user agent has indicated to accept in the request, and on what files are found to exist on the server's file system at the resolved location.

Interpretation and modification of PATH_TRANSLATED is analogous to what a classic CGI script receives via the SCRIPT_NAME and PATH_INFO/PATH_TRANSLATED variables: the classic scenario assumes there's a common document root (or web root) directory wherein a CGI program is looked up in a designated script directory. The program, if found, is then executed with the trailing part of the request URL (everything following the portion used to locate the program/script itself) as PATH_INFO parameter. PATH_TRANSLATED is derived from PATH_INFO by resolution against the document root, resulting in an absolute path name.

  • PATH_TRANSLATED can be alternatively computed (without knowledge of a document root directory (and provided SCRIPT_NAME is absolute against the web server's file system root, which it is typically not, however) by starting with the directory where SCRIPT_NAME resides and going back as many parent directories as there are path components in SCRIPT_NAME, then appending PATH_INFO

Request routing rules

If a static file is requested, as determined by requesting a resource name (in PATH_TRANSLATED) having a dot in its last path component

  • it is served "as-is" (as a static file), with a media type (HTTP Content-Type) derived from the file extension

    • note the HTTP Accept-header isn't checked in this case

    • commonly requested static file types include prerendered .html files as well as .css, .js, and image files

  • otherwise, if a static resource by the name of PATH_TRANSLATED name doesn't exist, a 404 NOT FOUND HTTP response is generated

Otherwise (PATH_TRANSLATED doesn't have a dot), if PATH_TRANSLATED can be resolved as a SGML file by appending the .sgm file extension

  • if text/sgml is accepted by the request

    • the SGML gateway determines scriptName as master SGML file/template and sends the static master file

    • for HTTP/2, optionally, the file resolved as PATH_TRANSLATED and other statically included entities determined by performing SGML document prolog parsing (LPD pre-scanning), are send along with the master file via HTTP/2 Push

  • otherwise, and by default (and if either no Accept header is present in the request or its value is text/html or a wildcard)

    • PATH_TRANSLATED is processed as SGML file, with the optional WEB link process activated, and the output is served as text/html response

Otherwise, (PATH_TRANSLATED cannot be resolved to a SGML file)

  • if the first path component of PATH_TRANSLATED (or the longest sequence of consecutive path steps contained twice in PATH_TRANSLATED) can be resolved as an SGML file by appending the sgm file extension,

    • if text/sgml is accepted for the response, the resolved file is served statically from the resolved file as text/sgml

    • otherwise, the resolved file is processed as SGML, with the optional WEB link process activated, and the output is served as text/html response; the remaining part of PATH_TRANSLATED (not including the initial part up to and including the resolved SGML file) is resolved against the web root directory to an absolute path, and supplied as the PATH_TRANSLATED system-specific entity to SGML processing

Otherwise (when the request's PATH_TRANSLATED value couldn't be interpreted in any of the ways explained) a 404 NOT FOUNDHTTP response is generated.

Whenever a SGML file is selected for processing, the file's modification date is checked against the value of the last-modified HTTP header, if present in the request. If the file is older than the last-modified value, it's not processed, and a 304 NOT MODIFIED HTTP response is returned to the client instead. In case the processed SGML is determined from the initial portion of PATH_TRANSLATED, the modification date of the file name denoted by the remaining part, is checked as well.

HTTP/2 Push todo

Optionally, when serving static SGML via HTTP/2 from one of the two mentioned routes explained above, the SGML document prolog is pre-scanned (effective entities are determined), and transferred as push resources (in a yet to be determined way what to push, priorities etc.)

Pushing entities requires invocation of SGML processing. As a fallback, in a scenario where a host web server can be configured to provide static resources (such as by using Apache's mod_rewrite wizzardry) outside of the SGML Web Server Gateway, and performed before entering into it, the required resources can be requested by the HTTP client in a second HTTP request/response cycle. On the Apache web server, in particular, it also might be possible to have a static re-generation process (generating HTML from source SGML when the SGML source has changed/is stale) trigger creation of header files understood by mod_header for HTTP/2 Push, without having to enter into SGML processing on each request.

File descriptor setup

Files resolved by the SGML Web Server Gateway itself are passed as open file descriptors to SGML processing, such that those can be accessed using <osfd> FSIs. The processing environment can access up to five file descriptors:

  • 0 (stdin); contains POSTed body content, when used and supported in the request

  • 1 (stdout): output of SGML processing; can be a file or a buffer

  • 2 (stderr); error output and log destination

  • 3 main input; contains character data from master file (resolved using either the complete path as of the initial value of PATH_TRANSLATED, or just the first path component of PATH_TRANSLATED otherwise)

  • 4 file descriptor containing character data resolved using the remainder potion of PATH_TRANSLATED if file descriptor #3 was resolved using only the first path component

File preopening todo

Before passing control to core SGML processing, the SGML gateway (on select execution environments) pre-opens the scriptName filename as /dev/fd/N, and PATH_TRANSLATED, if relevant, as /dev/fd/N+1.

Accessing open file descriptors rather than opening files by path name from main SGML processing as needed avoids race conditions and has generally desirable properties wrt. exploiting POSIX file system guarantees for atomic/continued/high-available content delivery in the presence of concurrent content change and maintenance.

Specifically, this is done to be able to guarantee that, after SGML prolog parsing, no request processing will fail due to missing template and/or client document files, and that the content of determined files remains accessible to SGML processing even if it is subject to concurrent change or deletion during processing.

In particular

  • the template and client files (if any) are held open by the SGML Web Server Gateway process, so those files can be atomically changed while request processing on previous content is underway (due to Unix file system guarantees)
  • note that these guarantees do not hold for further external entities other than the content of PATH_TRANSLATED from the main template file itself that might be referenced from the template file (or the client file if it is atypically transcluded as template and can have entity declarations)

  • a proper 404 NOT MODIFIED HTTP status can be send, rather than beginning the response with a 200 OK status and then detecting non-existence during content processing and having e.g. to include error message character data along with user content

  • to this aim, SGML processing takes advantages of being designed such that no output character data is written before the first content arrives in output buffer handling, ie. prolog data is buffered until actual content begins, hence HTTP 404 NOT FOUND or other non-default status can be set at the end of SGML prolog processing

Response status

Request processing is performed under the assumption of sending a default HTTP result of 200 OK unless set explicitly to another status, and assuming that, in general, the HTTP result status can't be changed once any output has been emitted.

Request processing is performed such that the complete SGML prolog of the document instance to process is validated before emitting any output. On any prolog parsing errors (including when system-specific parameter entities couldn't be resolved), processing is aborted, and a proper 404 NOT FOUND or 500 INTERNAL SERVER ERRO HTTP status, depending on whether a operating system error of ENOENT or non-ENOENT, resp. was encountered) is generated.

Parsing, resolution, or other errors during content parsing, OTOH, can't typically be reported via HTTP error status codes because status the response headers will have been sent to the client alraedy at the point where a content error is encountered.

Conditional requests

As already explained for the individual routing branches, based on the above sketched file name resolution for PATH_TRANSLATED, before actually accessing and sending content, up-to-datedness of the client document is checked; if it hasn't changed since the date and time of the last modification, a 304 NOT MODIFIED HTTP response is generated. Note that access policies etc. don't play into here as the content body isn't transferred with 304 responses.

Query parameters todo

If the request URL contains query parameters (such as in http://example.com/page?parameter=value...), these are provided as system-specific entities to SGML processing.

The QUERY_STRING request parameter is parsed according to the rules for HTML form-encoding.

LPD activation

When instructed to produce HTML from SGML, the SGML gateway will implicitly configure the processor to activate the "WEB" link process if it's declared in the target instance.

Apart from the WEB linktype, the processor also recognizes linktypes declared to have certain distinguished public identifiers in the sgmljs.net domain name namespace (commonly used with HTTP, and HTTP2 linktype names, respectively) for provision of HTTP protocol-level parameters such as values of HTTP headers, and HTTP cookies.

SGML User Agent

The SGML User Agent (sgml-ua.js) is a JavaScript (ES5) program for web browsers designed to produce HTML from SGML in the same way that HTML is be produced from SGML on a SGML Web Server, thereby transparently offloading SGML processing to the browser, and at the same time saving network bandwidth by avoiding redundant network transfer of repeated partial page content.

While the SGML User Agent is designed to run against a SGML Web Server, it can also run against any other (e.g. simple static) web server lacking SGML support, in browser-only mode, with reduced user agent functionality. More generally, a SGML web setup can involve:

  • both server and browser processing: SGML is rendered transparently on either, or both, the server (for the initial page load) and in the browser, depending on whether JavaScript is enabled/allowed in the browser
  • browser-only processing: SGML files are accessed as static files from the web server and then rendered into a displayed HTML DOM on the browser
  • server-only processing: SGML pages are sent as server-rendered HTML pages to browsers; Javascript support on the browser isn't required

The SGML User Agent, when started, determines if the web page is running from a server with server-side SGML support by inspecting page metadata in the HTML head element. If the head element does not contain

<link rel="alternate" type="text/sgml" ...>

then the SGML User Agent assumes it is running off a web server without server-side SGML rendering support.

Server-side SGML support is required for proper session history, whereas when server-side SGML support isn't advertised via the link element as shown, browser-refresh and back-navigation from an external site linked to from the SGML webpage will take the user to the initial landing page (the static or prerendered HTML page carrying the SGML User Agent script). Morevoer, bookmarking works only with server-side SGML support.

The basic functionality of the SGML User Agent is to, on window.onload, attach click handlers to the current document's local (same-domain) links performing SGML page rendering (transforming SGML content to HTML/DOM).

Specifically, this is enabled on anchors that have the same effective protocol/host/port as the invoking page, and that either have no type attribute specified, or have text/sgml specified as its value.

Once a page is rendered using SGML, its anchors get captured by SGML event handling for further navigation within the domain name, in turn.

Alternate activation on direct page load todo

The regular activation method explained above still requires an initial static/prerendered HTML page. This is undesired for simple site setups entirely run off static hosting.

To enable dynamic rendering of page content on a direct page load, rather than just on the outgoing links of an initial page (or generated page reached by navigation from the initial page), the host web server must be configured to redirect, rather than 404 NOT FOUND, to a static redirect/error page carrying sgml-ua.js.

The script should then immediately (on load) process the originally requested page, which can be accessed as document.referrer at this point, provided it was reached through a server-side redirect. Specifically, as the script can grab the original URL, this works without SGML, SSI or other server-side templating.

Static site services use different rules for redirect-on-404:

Note that redirect-based approaches might work against SEO and page ranking.

According to http://stackoverflow.com/questions/5657558/is-the-referer-set-if-you-redirect-to-a-new-web-page-using-location-href, redirecting using script doesn't preserve the original document.referrer value; so when using this method, we'd need to pass the desired page as a URL parameter. Note that "spoofing" the PATH_TRANSLATED entity by explicitly requesting a URL such as .../resource?PAGE_TRANSLATED=x is not accepted by the SGML User Agent for security reasons.

Other alternate for first-page rendering todo

Another possibility might be to supply one or more SGML files as master template(s) that can also be interpreted as HTML (using a DOCTYPE referencing an external DTD), containing a script to re-process the current page content as SGML.

The page source shouldn't contain entity references but it should be possible to place e.g. conref entities (though we need to use end-element tags as the browser doesn't know about end-element omission rules and EMPTYNRM).

This approach might work well in combination with a server setting where a static file is resolved even with excess PATH_INFO after the targeted file (PATH_INFO is then used by SGML processing to obtain the client document); however, this probably won't work well when the full path up to the last step is resolvable as directory.

History maintenance

History pushState()/popState() works in a natural way for basic forward and backward navigation): if we're about to navigate to another page (on the same domain so rendered via SGML), we're just storing the previous page via pushState(), with the URL used to fetch SGML (or HTML on the initial page). When we return to this state via backward navigation, the popstate event handler will pop the state and start re-rendering the HTML from the pushed href URL in the same way the SGML page was rendered when first visited.

On a page (browser) refresh, the browser reloads window.location using the regular browser page loading algorithm. When the server can render SGML to HTML (as triggered by an HTTP Accept header favouring HTML over SGML) there's nothing special to do here, since the re-visited page gets rendered server-side (and carries the sgml-ua.js script to attach to link handlers for further browser-local SGML processing, just like on an initial page load).

Static serving support

On the other hand, when working against a static web server, window.location will fetch SGML text, and browsers will render the SGML code text as either plain text (Chrome) or possibly broken HTML (FF, IE). Therefore, support for static servers involves further browser history manipulation.

Blocking browser refresh isn't possible in general. There exist various attempts/scripts to accomplish this by either

  • intercepting key events (but these techniques won't handle clicking on a browser refresh icon button); what can be achieved here (by either returning a non-void value from beforeunload event handling, by setting the event's returnValue, or by calling preventDefault), is to bring up a "do you really want to leave?" warning, but the reload action as such can't be prevented.

  • establishing a new browser context in an iframe or a HTML4 frame (see e.g. Disabling the Back Button), but these techniques are generally considered user-hostile

  • not creating history entries in the first place; ie. according to Back Button Behavior on a Page With an iframe (although being about iframes mostly) replaceState() can be used to suppress creating history entries; namely, if, on a click event, the navigated-to href is replaceState()d into the same as the top-most one then no history entry is created; this could be used to block any backward (and forward) navigation, but (if it actually works) is overreaching since it will disable plain backward navigation between SGML rendered pages, which isn't a a problem even against servers without server-side SGML rendering.

So what SGML User Agent does is to ensure that, while a page is up, window.location points to the landing page of the current page, ie. the page through which the site was entered, which in many cases will be the site's home page, but could be any page carrying the sgml-ua.js script). As final part of SGML rendering (after rendered link target URLs in the generated page have been changed into absolute/resolved form), window.location is set to the landing page URL. When the page is left, the history entry for the page view is then restored to the original SGML resource URL, rather than the landing page URL, so that on plain back navigation, regular popstate handler execution will render the SGML resource.

The original resource URL is stored in the history entry's data field (for as long as it is shadowed by the landing page URL).

Executing history restoration globally on the unload/beforeunload (or even pagehide/pageshow) events isn't possible, since Ajax page loads don't trigger those events. Therefore, history restoration is executed on individual outgoing link click events, along with SGML processing for the new page.

Note that no history restoration takes place (the landing page history entry is kept) on outgoing external links since those don't get a click handler for SGML processing attached and hence exhibit standard browser behavior on link activation. For external links we're going to loose the JavaScript execution context and can't register handlers; when navigating back from an external page we must therefore enter a HTML (not SGML) page carrying the sgml-ua.js script.

  • SGML User Agent starts out on an initial landing page carrying this script
  • following a link on the page in the same domain will result in rendering the link target using SGML
  • backward-navigation (to either an earlier rendered SGML page or the landing page) is also be performed using SGML
  • navigation/following links to external sites will end SGML UA execution and continue with standard browser HTML loading/rendering; on return to a SGML-rendered site, if the site is running on static web server without server-side SGML rendering, the landing page, not the page through which the site was left, is reloaded; only if using server-side SGML can the proper page of departure be loaded
  • page refreshes take the user to the landing page when not served from a web server with support for server-side SGML; only if using server-side SGML will the current page be reloaded (it's not possible to intercept browser behavior on refresh)
  • when running off a static web server without server-side SGML rendering, context menus (as activated by right-click or long-click/hold) are blocked to prevent the Open Link in new tab being offered and bookmarking (both of which won't work against a static server)

Third party scripts todo

Scroll position restoration todo

Page Transitions todo