As a straightforward application of SGML
basic entity substitution
and templating on the web,
page.sgm, shows a very simple
example of a SGML document, that, when accessed
as web page at
name entity reference substituted into
Tom, and returned as response to a browser
(or can be rendered entirely within a web browser
running SGML User Agent):
<!doctype html [ <!element html - - any> <!element p - - (#pcdata)> <!entity param system> ]> <html> <head> <title>SGML page</title> </head> <body> <p conref=e> </body> </html>
Provisioning SGML on the Web is based on interpreting HTTP request
URLs as file names according to terms and concepts in use for the
longest time on the Internet (
For the purpose of resolving a URL to a file or other resource, a
web host passes a request URL and other request parameters as value
PATH_TRANSLATED and other system-specific entities
to SGML processing, The SGML processor then either prepares HTML from SGML,
or just serves static files, depending on what media type the user agent
has indicated to accept in the request, and on what files are found
to exist on the server's file system at the resolved location.
Interpretation and modification of
PATH_TRANSLATED is analogous to
what a classic CGI script receives via the
the classic scenario assumes there's a common document root
(or web root) directory wherein a CGI program is looked up in
a designated script directory. The program, if found,
is then executed with the trailing part of the request URL
(everything following the portion used to locate the program/script
PATH_TRANSLATED is derived from
PATH_INFO by resolution against the document root,
resulting in an absolute path name.
PATH_TRANSLATED can be alternatively computed (without
knowledge of a document root directory (and provided
SCRIPT_NAME is absolute against the web server's file
system root, which it is typically not, however) by starting
with the directory where
SCRIPT_NAME resides and going
back as many parent directories as there are path components
SCRIPT_NAME, then appending
If a static file is requested, as determined by requesting
a resource name (in
PATH_TRANSLATED) having a dot in its last
it is served "as-is" (as a static file), with a media type
Content-Type) derived from the file extension
note the HTTP
Accept-header isn't checked in this case
commonly requested static file types include prerendered
.html files as well as
.js, and image files
otherwise, if a static resource by the name of
name doesn't exist, a
404 NOT FOUND HTTP response is generated
PATH_TRANSLATED doesn't have a dot), if
can be resolved as a SGML file by appending the
text/sgml is accepted by the request
the SGML gateway determines
scriptName as master SGML file/template
and sends the static master file
for HTTP/2, optionally, the file resolved as
and other statically included entities determined
by performing SGML document prolog parsing (LPD pre-scanning),
are send along with the master file via HTTP/2 Push
otherwise, and by default (and if either no
Accept header is
present in the request or its value is
text/html or a wildcard)
PATH_TRANSLATED is processed as SGML file, with the optional
WEB link process activated, and the output is served as
PATH_TRANSLATED cannot be resolved to a SGML file)
if the first path component of
PATH_TRANSLATED (or the longest
sequence of consecutive path steps contained twice in
PATH_TRANSLATED) can be resolved as an SGML file by appending
sgm file extension,
text/sgml is accepted for the response, the resolved file
is served statically from the resolved file as
otherwise, the resolved file is processed as SGML,
with the optional
WEB link process activated, and the
output is served as
the remaining part of
PATH_TRANSLATED (not including the
initial part up to and including the resolved SGML file)
is resolved against the web root directory to an absolute path,
and supplied as the
PATH_TRANSLATED system-specific entity to
Otherwise (when the request's
PATH_TRANSLATED value couldn't be
interpreted in any of the ways explained) a
404 NOT FOUNDHTTP
response is generated.
Whenever a SGML file is selected for processing, the file's
modification date is checked against the value of the
HTTP header, if present in the request. If the file is older than
last-modified value, it's not processed, and a
304 NOT MODIFIED
HTTP response is returned to the client instead. In case the processed
SGML is determined from the initial portion of
modification date of the file name denoted by the remaining part,
is checked as well.
Optionally, when serving static SGML via HTTP/2 from one of the two mentioned routes explained above, the SGML document prolog is pre-scanned (effective entities are determined), and transferred as push resources (in a yet to be determined way what to push, priorities etc.)
Pushing entities requires invocation of SGML processing.
As a fallback, in a scenario where a host web server can be
configured to provide static resources (such as by using
mod_rewrite wizzardry) outside of the SGML Web
Server Gateway, and performed before entering into it,
the required resources can be requested by the HTTP client
in a second HTTP request/response cycle. On the Apache web server,
in particular, it also might be possible to have a static
re-generation process (generating HTML from source
SGML when the SGML source has changed/is stale) trigger creation
of header files understood by
mod_header for HTTP/2 Push,
without having to enter into SGML processing on each request.
Files resolved by the SGML Web Server Gateway itself are passed as
open file descriptors to SGML processing, such
that those can be accessed using
<osfd> FSIs. The processing
environment can access up to five file descriptors:
POSTed body content, when used and supported in the request
stdout): output of SGML processing; can be a file or a buffer
stderr); error output and log destination
3 main input; contains character data from master file
(resolved using either the complete path as of the initial value of
PATH_TRANSLATED, or just the first path component of
4 file descriptor containing character data resolved using the
remainder potion of
PATH_TRANSLATED if file descriptor #3 was
resolved using only the first path component
Before passing control to core SGML processing, the SGML gateway
(on select execution environments) pre-opens the
PATH_TRANSLATED, if relevant, as
Accessing open file descriptors rather than opening files by path name from main SGML processing as needed avoids race conditions and has generally desirable properties wrt. exploiting POSIX file system guarantees for atomic/continued/high-available content delivery in the presence of concurrent content change and maintenance.
Specifically, this is done to be able to guarantee that, after SGML prolog parsing, no request processing will fail due to missing template and/or client document files, and that the content of determined files remains accessible to SGML processing even if it is subject to concurrent change or deletion during processing.
note that these guarantees do not hold for further external entities
other than the content of
PATH_TRANSLATED from the main template file itself
that might be referenced from the template file (or the client file if it
is atypically transcluded as template and can have entity declarations)
404 NOT MODIFIED HTTP status can be send, rather than beginning
the response with a
200 OK status and then detecting non-existence
during content processing and having e.g. to include error message
character data along with user content
to this aim, SGML processing takes advantages of being designed such
that no output character data is written before the first content arrives
in output buffer handling, ie. prolog data is buffered until actual
content begins, hence HTTP
404 NOT FOUND or other non-default status
can be set at the end of SGML prolog processing
Request processing is performed under the assumption of sending
a default HTTP result of
200 OK unless set explicitly to
another status, and assuming that, in general, the HTTP result
status can't be changed once any output has been emitted.
Request processing is performed such that the complete SGML
prolog of the document instance to process is validated before
emitting any output. On any prolog parsing errors (including when
system-specific parameter entities couldn't be resolved),
processing is aborted, and a proper
404 NOT FOUND or
500 INTERNAL SERVER ERRO HTTP status, depending on whether
a operating system error of
ENOENT or non-
was encountered) is generated.
Parsing, resolution, or other errors during content parsing, OTOH, can't typically be reported via HTTP error status codes because status the response headers will have been sent to the client alraedy at the point where a content error is encountered.
As already explained for the individual routing branches,
based on the above sketched file name resolution for
before actually accessing and sending content, up-to-datedness of the
client document is checked; if it hasn't changed since the date and time
of the last modification, a
304 NOT MODIFIED HTTP response is generated.
Note that access policies etc. don't play into here as the content body
isn't transferred with 304 responses.
If the request URL contains query parameters (such as in
http://example.com/page?parameter=value...), these are
provided as system-specific entities to SGML processing.
QUERY_STRING request parameter is parsed according
to the rules for HTML form-encoding.
When instructed to produce HTML from SGML, the SGML gateway will implicitly configure the processor to activate the "WEB" link process if it's declared in the target instance.
Apart from the
WEB linktype, the processor also recognizes linktypes
declared to have certain distinguished public identifiers in the
sgmljs.net domain name namespace (commonly used with
HTTP2 linktype names, respectively) for provision of HTTP
protocol-level parameters such as values of HTTP headers,
and HTTP cookies.
The SGML User Agent (
program for web browsers designed to produce HTML from SGML
in the same way that HTML is be produced from SGML on a
SGML Web Server, thereby transparently
offloading SGML processing to the browser, and at the same
time saving network bandwidth by avoiding redundant network
transfer of repeated partial page content.
While the SGML User Agent is designed to run against a SGML Web Server, it can also run against any other (e.g. simple static) web server lacking SGML support, in browser-only mode, with reduced user agent functionality. More generally, a SGML web setup can involve:
The SGML User Agent, when started, determines if the web page
is running from a server with server-side SGML support by inspecting
page metadata in the HTML
head element. If the
does not contain
<link rel="alternate" type="text/sgml" ...>
then the SGML User Agent assumes it is running off a web server without server-side SGML rendering support.
Server-side SGML support is required for proper session history, whereas when server-side SGML support isn't advertised via the link element as shown, browser-refresh and back-navigation from an external site linked to from the SGML webpage will take the user to the initial landing page (the static or prerendered HTML page carrying the SGML User Agent script). Morevoer, bookmarking works only with server-side SGML support.
The basic functionality of the SGML User Agent is to, on
click handlers to the current
document's local (same-domain) links performing SGML page
rendering (transforming SGML content to HTML/DOM).
Specifically, this is enabled on anchors that have the same
effective protocol/host/port as the invoking page, and that
either have no
type attribute specified, or have
specified as its value.
Once a page is rendered using SGML, its anchors get captured by SGML event handling for further navigation within the domain name, in turn.
The regular activation method explained above still requires an initial static/prerendered HTML page. This is undesired for simple site setups entirely run off static hosting.
To enable dynamic rendering of page content on a direct page
load, rather than just on the outgoing links of an initial page
(or generated page reached by navigation from the initial page),
the host web server must be configured to redirect, rather than
404 NOT FOUND, to a static redirect/error page carrying
The script should then immediately (on load) process the originally
requested page, which can be accessed as
document.referrer at this point,
provided it was reached through a server-side redirect.
Specifically, as the script can grab the original URL, this works
without SGML, SSI or other server-side templating.
Static site services use different rules for redirect-on-404:
some require explicit redirect rules for each resource (such as
which is undesired
Apache-based static serving (honoring
htaccess files) might be
able to use
as a fall back, auto-redirect-on-404 can be setup for many static
hosting services via a custom 404 page (e.g.
Note that redirect-based approaches might work against SEO and page ranking.
redirecting using script doesn't preserve the original
value; so when using this method, we'd need to pass the desired
page as a URL parameter. Note that "spoofing" the
entity by explicitly requesting a URL such as
is not accepted by the SGML User Agent for security reasons.
Another possibility might be to supply one or more SGML files
as master template(s) that can also be interpreted as HTML
DOCTYPE referencing an external DTD), containing a
script to re-process the current page content as SGML.
The page source shouldn't contain entity references but it
should be possible to place e.g. conref entities (though we
need to use end-element tags as the browser doesn't know about
end-element omission rules and
This approach might work well in combination with a server
setting where a static file is resolved even with excess
after the targeted file (
PATH_INFO is then used by SGML
processing to obtain the client document); however, this probably
won't work well when the full path up to the last step is resolvable
popState() works in a natural way for basic
forward and backward navigation): if we're about
to navigate to another page (on the same domain so rendered
via SGML), we're just storing the previous page via pushState(),
with the URL used to fetch SGML (or HTML on the initial page).
When we return to this state via backward navigation, the
popstate event handler will pop the state and start re-rendering the
HTML from the pushed
href URL in the same way the SGML page was rendered
when first visited.
On a page (browser) refresh, the browser reloads
window.location using the regular browser page loading algorithm.
When the server can render SGML to HTML (as triggered by an
HTTP Accept header favouring HTML over SGML) there's nothing special
to do here, since the re-visited page gets rendered server-side
(and carries the
sgml-ua.js script to attach to link handlers for
further browser-local SGML processing, just like on an initial
On the other hand, when working against a static web server,
window.location will fetch SGML text, and browsers will render
the SGML code text as either plain text (Chrome) or possibly
broken HTML (FF, IE). Therefore, support for static servers
involves further browser history manipulation.
Blocking browser refresh isn't possible in general. There exist various attempts/scripts to accomplish this by either
intercepting key events (but these techniques won't
handle clicking on a browser refresh icon button); what
can be achieved here (by either returning a non-void value
beforeunload event handling, by setting the
returnValue, or by calling
is to bring up a "do you really want to leave?"
warning, but the reload action as such can't be prevented.
establishing a new browser context in an
or a HTML4
frame (see e.g.
Disabling the Back Button),
but these techniques are generally considered user-hostile
not creating history entries in the first place; ie.
Back Button Behavior on a Page With an iframe
(although being about iframes mostly)
can be used to suppress creating history entries; namely,
if, on a
click event, the navigated-to
replaceState()d into the same as the top-most one
then no history entry is created; this could be used
to block any backward (and forward) navigation, but
(if it actually works) is overreaching since it will
disable plain backward navigation between SGML rendered
pages, which isn't a a problem even against servers
without server-side SGML rendering.
So what SGML User Agent does is to ensure that, while a page is up,
window.location points to the landing page of the current
page, ie. the page through which the site was entered,
which in many cases will be the site's home page, but could
be any page carrying the
As final part of SGML rendering (after rendered link
target URLs in the generated page have been changed into
window.location is set to the
landing page URL. When the page is left, the history entry
for the page view is then restored to the original SGML
resource URL, rather than the landing page URL, so that
on plain back navigation, regular
handler execution will render the SGML resource.
The original resource URL is stored in the history entry's data field (for as long as it is shadowed by the landing page URL).
Executing history restoration globally on the
unload/beforeunload (or even
pagehide/pageshow) events isn't
possible, since Ajax page loads don't trigger those
events. Therefore, history restoration is executed
on individual outgoing link click events, along with
SGML processing for the new page.
Note that no history restoration takes place (the
landing page history entry is kept) on outgoing external
links since those don't get a click handler for SGML
processing attached and hence exhibit standard browser
behavior on link activation. For external links we're
can't register handlers; when navigating back
from an external page we must therefore enter
a HTML (not SGML) page carrying the
when running off a static web server without server-side SGML rendering, context menus (as activated by right-click or long-click/hold) are blocked to prevent the Open Link in new tab being offered and bookmarking (both of which won't work against a static server)