SGML

Syntax Reference

Introduction

SGML (like HTML, which is based on SGML), is a text format starting from the idea of organizing information by tagging or marking up text. SGML is a meta-language for describing markup vocabularies such as HTML and their parsing rules.

Consider the following basic HTML document:

<html>
<head>
	<title>Page Title</title>
</head>
<body>
	<h1>Section Title</h1>
	<p>Body Text with <a href="otherdoc.html">link to another document</a></p>.
	<footer>Page Footer</footer>
</body>
</html>

The element grammar for this document can be described as a SGML Document Type Definition (DTD) as follows:

<!ELEMENT html - - (head?,body)>
<!ELEMENT head - - (title?)
<!ELEMENT title - - (#PCDATA)>
<!ELEMENT body - - (h1,p+)
<!ELEMENT h1 - - (#PCDATA)>
<!ELEMENT p - - (#PCDATA|a)>
<!ELEMENT a - - (#PCDATA)>

In this grammar, the regular expression head?,body means that that the content of the html element is expected to consist of an (optional) head element, followed by a body element, and both the head and the body element have grammar rules for their content, in turn. #PCDATA means that text is expected at the respective position.

Given such a markup grammar and other declarations, SGML can

  • check the markup of a given document or a larger collection of documents, and enforce presence or absence of tags or attributes

  • infer tags and attributes not present in a document but desired for content delivery (as used for automatically adding boilerplate and structuring content in web applications, and to simplify content creation)

  • attach processing to elements or more complex contexts for generating dynamic web content or other template processing application for content production

SGML can be used for

  • content authoring and workflow organization using straightforward concepts such as files and folders, as well as more sophisticated declarative techniques for web applications

  • content delivery over the web, with rich facilities for fetching and preparing content from databases or web services, and for integration into mainstream web application stacks

  • sanitizing potentially malicious user content in dynamic web applications or content production processes (injection prevention).

  • searching, transforming, analyzing and otherwise processing web content and other markup documents.

The following sections describe the markup declarations that can be used in a DTD, and their effect on the respective markup constructs in document content.

Elements

The general form of an element declaration is

<!ELEMENT element-name [rank] [tag-omission-rules] content [exceptions]>

or

<!ELEMENT name-group [rank] [tag-omission-rules] content [exceptions]>

where

element-name

is a single element name to declare

name-group

is a list of element names to declare

an element list has the form (element1|element2|...|elementN).

rank (optional)

is a non-negative decimal number which is treated as a rank suffix that the declared element must have when used in content

the element name is treated as a rank stem, rather than a complete element name, if a rank suffix is specified

an element declared with rank having a rank suffix specified in content (ie. ending in a number in a start-element tag), sets the implied rank suffix for any element tag in subsequent content

for an element declared with rank having its rank suffix omitted in content, the effective rank suffix is that of the most recent element declared in the same declaration that has a rank suffix specified; the most recent element doesn't necessarily have to be a parent element, but can be any preceding element

it's an error if the first occurrence of an element declared with rank in a document instance has its rank suffix omitted

with respect to rank minimization, sgmljs.net treats all elements declared with the same rank suffix in a DTD as if those were declared in the same declaration; ie. a rank suffix is not only inferred from prior elements declared in the same declaration, but from any prior element having a rank declared and specified in content

an element declared with rank is referenced by its rank stem and rank suffix as concatenated name from other declarations; e.g. an element declared with rank stem abc and rank suffix 3 is referenced as abc3 in content model expressions of other element declarations where the element may occur as content model token

note that using element ranks does not in itself enable uses such as e.g. automatically assigning/incrementing header levels based on tag nesting levels; instead, rank omission always infers from the most recently specified rank: see rank-examples

tag-omission-rules (optional)

- - means both start- and end-tag must be specified

- O means the end-tag can be left out

O - means the start-element tag can be left out

O O means both start- and end-element tag may be left out

in the above syntax rules O refers to the letter O, and - to the minus character

there must be whitespace between the specifier for start- and end-tag omission rules

the tag-omission-rules specification may be left out altogether in which case it defaults to - -

see Tag Inference for applying tag omission rules

content

either a Content Model, with surrounding parentheses

or ANY, allowing any content

or EMPTY, which forbids the element to have content

or CDATA, which will make element content parse as character data

or RDATA, which will make element content parse as character data, with general entity references being expanded into the respective entity replacement text

see below for detailed explanation

exceptions

an expression of the form -(exclusions) +(inclusions) where either the exclusions- or the inclusions-part, or both, can be omitted

if both the exclusion- and the inclusion-part is specified, then the inclusion-part must follow the exclusion-part

inclusions is a single element or a name group (a list of elements) allowed to occur anywhere and arbitrarily often in descendant content in addition to elements specified in the content model

exclusions is a single element or a name group of elements not allowed to occur in descendant content, even though allowed by the content model or included by an element declaration for a parent element

if an element is excluded, it can't be included by an element declaration for a descendant element (the inclusion is ignored)

an element that is required in a content model can't be excluded

it's an error for an element to be both excluded and included in the same declaration

if an element occurs at a position where it matches a model group token, and is also in the set of included elements, then it is accepted as content model token (inclusion of the element is ignored)

exceptions can only be specified for elements having a content model or for elements with declared content ANY

See also exception examples.

Declared content

An element declared ANY, EMPTY, or CDATA is said to have declared content.

ANY content

When an element is declared to have ANY content, any content (character data or any nested elements, subject to the effective value of IMPLYDEF ELEMENT in the SGML declaration) may occur between the element start- and end-tag.

EMPTY content

When an element is declared to have EMPTY content, it must be specified

  • either just in start-element tags (ie. end-element tags can't be used for that element at all), or,

  • if EMPTYNRM YES is specified in the SGML declaration, with an optional end-element tag immediately following the start-element tag.

Note that if, in addition to FEATURES MINIMIZE EMPTYNRM YES, also FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET is specified in the SGML declaration, and / and > are declared to have the NESTC and NET delimiter roles, respectively, then any element having no content (regardless of whether the element is declared EMPTY), can be specified as an XML-style empty element, ie. can be abbreviated by <element/>, instead of having to specify <element></element> (see SGML declaration for details).

Note that sgmljs.net supports only the characters stated above for the NESTC and NET delimiter role (or no assignment to these delimiters at all). Moreover, sgmljs.net restricts supported combinations of the FEATURES MINIMIZE EMPTYNRM and the FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET SGML declarations properties to have either the values stated above, or to have both the value NO. The first combination, introduced with WebSGML (the Annex K revision of SGML), corresponds to modern polyglot markup writing (and is used by default in sgmljs.net), while the latter corresponds to the traditional SGML authoring style.

Note that, when processing XML, empty elements are required to either have end-element tags, or to be specified as XML-style empty elements (ie. as <element/>).

Note that apart from declaring an element to have EMPTY content, an element must also have empty content when a #CONREF attribute is specified on it; see #CONREF in attribute default values.

CDATA content

Elements declared CDATA contain unparsed character data as child content.

The & (ampersand) character has no special meaning in content of elements declared CDATA: character sequences looking like named entity references aren't expanded to replacement text, and are, like character entity references, reproduced as-is to result markup.

A < (lower-than) character followed by valid name start character terminates content of elements declared CDATA, just like regular elements declared with content models.

Note the CDATA reserved word is also used as declared value of attribute declaration and entity declarations, and as keyword in marked sections.

See declared content examples.

Content models

A content model specifies the sequence of sub-elements and/or character data content that an element's child content is expected to have.

It is specified by a content model expression. For example, the content model expression a, b?, c* describes a sequence consisting of a single a element, followed by an optional b element, followed optionally by a sequence of any number of c elements.

A content model expression is an expression constructed from content model tokens and compositors, with optional grouping and nesting of subexpressions in parentheses.

Content model tokens

Content model tokens are either

  • element names declared in the same or another element declaration within the same declaration set, or

  • the #PCDATA token representing parsed character data being allowed at the position in the content model expression where it is specified

Compositors

A compositor is one of the following characters, listed along with the compositor's application to operand elements and/or compound subexpressions, and its semantics:

operand? (zero-or-one compositor)

means "zero or one" of the element or content model subexpression to which it applies

the operand element or content model subexpression to which the compositor applies is written to the left of the compositor

operand* (Kleene star compositor)

means "zero or more" of the element or content model subexpression to which it applies

the operand element or content model subexpression to which the compositor applies is written to the left of the compositor

operand+ (plus compositor)

means "one or more" of the operand element or content model expression to which it applies

the operand element or content model subexpression to which the compositor applies is written to the left of the compositor

an expression such as a+ where a is an element or subexpression, is equivalent to a,a*

left-operand, right-operand (comma compositor)

means "a sequence of the left, followed by the right operand" element name or content model subexpression

(operand) (grouping)

expressions can be grouped in parentheses such that they can be used as operands to higher level compositors; when parentheses are omitted, content model expressions are parsed left-to-right, ie. a compositor to the left of an operand takes precedence over a compositor to the right

left-operand & right-operand (allgroups-compositor)

means any sequence of the operand elements or subexpressions, provided that any one element or subexpression occurs at most once in total

when applied to an operand subexpression that has the "zero-or-one" compositor as top-most compositor, that operand subexpression isn't required to occur, but if it occurs, it must occur at most once anywhere in the content of the element being declared

the content model expression a & b & c is equivalent to the content model expression (a,((b,c)|(c,b)))|(b,(a,c)|(c,a))|(c,(a,b)|(b,a))

Note:

In sgmljs.net SGML, operands of the allgroup compositor must be either

  • a single element name, or

  • a subexpressions having the zero-or-one compositor, the sole operand of which is a single element name.

More complex operands for the allgroup compositor aren't supported.

If #PCDATA is specified as content token, it is implicitly treated as if (#PCDATA)* were specified, ie. parsed character data is always optional in content models.

Content models must be unambiguous, ie. any content token must be uniquely matched without looking ahead at subsequent content tokens for disambiguation. For example, the content model

(a,b)|(a,c)

is not unambiguous, since element a can be matched as the beginning of either (a,b) or (a,c). On the other hand, the equivalent content model expression

a,(b|c)

is unambiguous.

Tag Inference

For automatic generation of required elements not present in content, FEATURES MINIMIZE OMITTAG YES must be enabled in the SGML declaration (which it is by default, except when processing XML).

In the following description of SGML tag inference, trivial actions on special conditions aren't described, such as on

  • ANY content models, or, equivalently, implied-ANY elements; implied-ANY elements are elements having child elements with implied element declarations ie. undeclared elements (when allowed to occur via IMPLYDEF ELEMENT YES)

  • EMPTY elements, or, equivalently, implied-EMPTY elements (elements governed by content references)

  • inference of document elements (which is just a special case of general start-element tag inference).

See also tag inference examples.

Actions performed on a start-element tag, or on parsed character data

  1. Close definitely completed elements

    Definitely completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration such that only an end-element tag for the enclosing (definitely completed) element is accepted at the context position.

    A model group ending in an optional content token or in a content token with one-or-more compositor can't be definitely completed, and isn't considered for automatic closing.

    It's an error if a definitely completed element's end-element tag isn't omissible at this point, because a start-element action cannot be accommodated at the context position.

  2. Check if the start-element tag or parsed character data is accepted at the context position; that is, check it's accepted at the current position in the model group and isn't excluded via exclusion exceptions

  3. Open contextually required elements

    Elements are contextually required if the content model of the enclosing element accepts a single element at the context position as a required element.

    The element to accommodate is not influential in opening elements here, only the state of model group(s) already opened is considered.

  4. If a contextually required element is opened, and matches the content token to accommodate, tag inference is completed for this action

Additional rules

The following actions are performed by sgmljs.net SGML in addition (these and similar recovery actions are also performed by third party SGML parsers such as SP, but are reported as recoverable errors by those parsers, whereas sgmljs.net SGML performs these actions silently):

At step 3, if it isn't possible to open a contextually required element, and

  • the context is immediately below the document element (such that inferring an end-element tag at the context position will close the document element, and logically end the document), and

  • the element to accomodate is not declared to have rank or ends with a numeric token (see next section), and

  • there's a single transition over an element from the context state, and

  • the start-element tag of the element to transition over is omissible

then that single transitioned-over element is opened as if it were contextually required.

At step 3, if it isn't possible to open a contextually required element, and

  • the element to accommodate is declared as having rank, and

  • the element's rank suffix to accommodate is higher (numerically larger) than that of the parent element (or the parent has no rank in which case it is be treated as having rank 0), and

  • there's a single transition over a ranked element from the context state, and

  • the rank of that single transitioned-over element is the same as that of the element to accommodate, and

  • the start-element tag of the ranked element to transition over is omissible

then that single transitioned-over element is opened as if it were contextually required.

Moreover, if it isn't possible to open either a contextually required element or a rank-implied element as described, the parent element is closed, if it is potentially completed (see definition below).

At step 2, if the element to accomodate isn't accepted at the context position due to exclusion exceptions, close as many potentially completed parent elements as necessary until it is (ie. until no more exclusion exception apply to the element to accomodate, if possible).

Actions performed on an end-element tag

  1. Close potentially completed elements

    Potentially completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration; as opposed to definitely completed elements, the model group may allow further optional elements, or end in a content token (or in a nested model group) having the one-or-more compositor.

  2. If the end-element to accommodate matches the most recently closed element, tag inference is completed for this action

Empty element minimization

As an additional minimization feature, SGML supports omission of start- and end-element tags. This feature doesn't require any special markup declaration and can be applied on any element (except on start-element tags on the document element) subject to the FEATURES MINIMIZE SHORTTAG STARTTAG EMPTY and FEATURES MINIMIZE SHORTTAG ENDTAG EMPTY SGML declaration settings, respectively.

An empty start-element tag is treated as if it were a start-element tag for the most recently closed element. For example, the empty start-element tag <> in the following markup text

<foo>
  <bar>...</bar>
  <>...</bar>
</foo>

is interpreted as <bar> start-element tag.

An empty end-element tag is treated as if it were an end-element tag for the context element (eg. name of the nearest unclosed element). For example, </> is equivalent to </bar> in the following markup text:

<foo>
  <bar>...</>
</foo>

Note: the SGML terms empty start-element tag (and empty end-element tag) is used for the <> and </> tokens. An XML-style empty element token, on the other hand, represents a different concept.

Attributes

Declarations for attribute lists take the form

<!ATTLIST element-name attribute-name declared-value [default-value]
                      [attribute-name declared-value [default-value]] ...>

or

<!ATTLIST name-group attribute-name declared-value [default-value]
                    [attribute-name declared-value [default-value]] ...>

or

<!ATTLIST #ALL attribute-name declared-value [default-value]
                    [attribute-name declared-value [default-value]] ...>

where

element-name

is a single element name to declare attributes for

name-group

is a list of element names to declare attributes for

an element list has the form (element1|element2|...|elementN).

#ALL

declares the attribute on all (declared or undeclared) elements when used in place of element-name or name-group

attribute-name

is the name of the attribute to declare

declared-value

is one of the following possible lexical value types

  • an enumerated value type

  • CDATA, allowing any quoted string to be used as attribute value

  • ENTITY, allowing a name token declared as entity name in the same declaration set; the token doesn't need quoting

  • ENTITIES, allowing, in addition to ENTITY, a space-separated list of name tokens declared as entity names; when actually specifying more than a single entity name in content, the attribute value must be quoted

  • ID, allowing a name token, which must be unique among all name tokens used as ID in a document, and which establishes an ID value for reference by IDREF or IDREFS attribute

  • IDREF, allowing a name token used as ID in the same document; the token doesn't need quoting

  • IDREFS, allowing, in addition to IDREF, a space-separated list of name tokens declared as ID attribute value; when actually specifying more than a single ID value in content, the attribute value must be quoted

  • NAME, allowing a name token; the token doesn't need quoting

  • NAMES, allowing, in addition to NAME, a space-separated list of name tokens

  • NMTOKEN, allowing, in addition to NAME, a token beginning with . (dot), - (minus), or _ (underscore), whereas NAME allows these characters to occur only at the second or subsequent position in the attribute value

  • NMTOKENS, allowing, in addition to NAMES, a list of tokens, each of which beginning with . (dot), - (minus), or _ (underscore)

  • NOTATION, allowing the attribute value to have a notation name specified in the enumerated list of permitted notation names

  • NUMBER, allowing a sequence of digits as attribute value

  • NUMBERS, allowing, in addition to NUMBER, a list of numerical values to occur

  • NUTOKEN, allowing a sequence of digits, followed by a sequence of letters (such as 64px)

  • NUTOKENS, allowing, in addition to NUTOKEN, a list of NUTOKEN tokens

  • [data attribute specification], allowing for custom data attribute checks and value normalization (see Data attribute specification)

A single attribute list declaration can declare one or more attributes for one or more elements (when using the name group declaration variant).

Conversely, attributes of the same element can also be declared in multiple attribute list declarations (from potentially multiple declaration sets). But the same attribute for a given element can be effectively declared at most once in all applicable attribute list declaration for a given element, ie. multiple declarations for the same attribute on a given element aren't rejected, but only the first declaration, in document order (and by extension in the order in which declaration sets are processed) becomes effective while latter declarations are ignored.

See attribute declaration and use examples.

Default value

The default value is either

  • (for enumerated values) one of the enumerated values

  • (for NOTATION attributes) one of the enumerated notation names

  • (for other attributes) an attribute value literal; only needs quotes if the default value isn't a name token

  • the token #REQUIRED, which means the attribute must be specified, and must have a value

  • the token #IMPLIED, which means the attribute doesn't have to be specified (is optional)

  • the token #CONREF, which means that, if the attribute is specified, then the element on which it is specified is treated as if it were declared EMPTY

The token #FIXED may be specified before default values of the first, second, or third form above. When specified, the attribute either must have the default value, or mustn't be used at all on the respective element.

Note that assigning template entities to attributes declared #CONREF can have additional semantics to the effect that the element on which the #CONREF attribute is specified gets replaced by external content.

Enumerated values

An attribute declaration such as

    <!ATTLIST elmt attr (val1|val2|val3) val1>

declares the attribute attr on element elmt.

The attribute can have either of the values val1, val2, or val3, and its default value (its value when not specified on the element explicitly) is val1.

Element wildcards WebSGML

Using the #ALL keyword, it's possible to declare one or more attributes on all elements; depending on whether undeclared elements are allowed (eg. by using IMPLYDEF ELEMENT YES or IMPLYDEF ELEMENT ANYOTHER as explained below), attributes declared in an attribute list declaration with #ALL can also be used on undeclared elements.

An attribute can be declared both in an #ALL attribute list as well as in a regular attribute list for a single element or an element namegroup at the same time. If an attribute is declared both on an individual element and on #ALL elements, its usage must satisfy both declarations.

For example, an attribute can be declared to have an enumerated value in an #ALL attribute list, and can be declared to have a #FIXED value in an attribute list declaration for an individual element. In this way, it's possible to model a common design pattern in DTDs, wherein an attribute declaration can be declared on an individual element in a more specific way than a generic declaration for the attribute in an #ALL attribute declaration, while the generic #ALL declaration still expresses a baseline declaration and common requirement for the attribute's use accross all element used in a document.

It's a design error (and reported by sgmljs.net SGML as attribute validation error on actual attribute use), if an attribute is declared both as an #ALL attribute and as an attribute on an individual element, when the two declarations are not satisfiable simultaneously. For example, a #FIXED value for an attribute declared in an #ALL attribute declaration can't be refined by declaring a different #FIXED value on an individual element for the same attribute.

The order of an #ALL declaration relative to an attribute declaration of an individual element for the same attribute isn't significant and doesn't change the interpretation of attribute declarations. Moreover, #ALL attribute declarations always apply to all elements of the document type and DTD containing the declaration, irrespective of whether element declarations are placed before or after the respective #ALL attribute declaration in document order (or are present at all).

Note sgmljs.net doesn't support WebSGML's other keywords (such as #IMPLICIT) on attribute declarations in place of #ALL. Moreover, #ALL isn't supported for data attributes (ie. attributes of notations; see below).

Data attribute specifications WebSGML

In addition to the build-in parsing types for attributes as described above, attributes can be declared to have custom data types (this form of declaration makes use of notations explained in the next section).

For example, the following declarations

<!NOTATION html5-form-input
	PUBLIC "+//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN">
<!ATTLIST elmt attr DATA html5-form-input>

declare the attr attribute to have a lexical type identified with a notation having the public identifier +//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN.

This public identifier represents the collection of lexical datatypes specified by HTML 5 form input validation, and imposes validation and value normalization to attribute (and plain text content of CDATA and SDATA data entities entities declared to be in that notation.

WebSGML allows specifying attributes for the data library notation such as in

<!ATTLIST elmt attr DATA html5-form-input [ type="email" ]>

or

<!ATTLIST elmt attr DATA html5-form-input [ pattern="XYZ\d+" ]>

In the absence of the type or pattern attribute, sgmljs.net will behave as if text (the most basic HTML 5 input form validation type) had been specified for the type attribute. text accepts any text value as content, and the value normalization applied is restricted to removing newlines, if present.

See form input value checking for more details on lexical value checking.

Notations

A notation, in general SGML terms, is a representation format for data such as the image formats PNG, GIF, or JPEG, or a text format such as TeX for typesetting mathematics.

Notation markup can be used to specify content in a different data representation format than SGML, either embedded in a SGML document, or as a reference to an external resource.

In sgmljs.net, the notation construct is also used to provide custom processing on markup for a broad class of applications such as content formatting and filtering; see templating,

A notation is declared as follows:

<!NOTATION notation-name identifier>

where

notation-name

is the name of the notation to declare

identifier

is the public and/or system identifier for the notation. as used to identify the notation by either a built-in notation (SGML, SQL, SPARQL, etc.), or by an external custom notation; see identifiers

Notation attributes can be used to markup a piece of inline text as "in a notation": in the following example, the characters \sqrt{2} are marked up as TeX-formatted math:

<!doctype example [
	<!element example (math)+)
	<!element math CDATA>
	<!attlist math format notation (tex) #implied>
	<!notation tex public "TeX">
]>
<example>
	<math format=tex>\sqrt{2}</math>
</example>

Note this is only an example of how to specify inline notation data; the use of the ad-hoc public identifier TeX here won't cause sgmljs.net SGML to execute TeX instructions.

Note that when using notation attributes, the content restrictions and entity expansion behaviour declared in the element declaration for the element on which it is declared and specified apply unchanged.

The syntax for declaring (and specifying values for) NOTATION declared attributes is very similar to that of enumerated values; see attribute examples.

For using notations with external entities, see entities.

Data attributes

Like elements, notations can have attributes. Data attributes are used to configure properties of external data entities, or of inline notational content; see templating for details.

Data attributes are declared as follows:

<!ATTLIST #NOTATION notation-name attribute-name declared-value default-value
                                 [attribute-name declared-value default-value] ...>
  • for data attributes, the same rules as for element attributes apply, with the following exceptions

  • data attributes can't have a declared value of ID, IDREF, IDREFS, NOTATION, ENTITY, or ENTITIES (however, special rules apply for templating)

  • unlike element attributes, data attributes must be declared and aren't subject to MINIMIZE IMPLYDEF ATTLIST YES when declared in the SGML declaration

Entities

An entity, in SGML, is a stream of character data.

An entity declaration introduces a name for an entity for subsequent use in the SGML prolog or in content. Parsed entities (see general entities) are used for entity references, which are replaced by the entity's character data on processing. Unparsed entities (see data entities) are used as values of ENTITY (or ENTITIES) attributes for templating or are processed in other entity type-specific ways.

General entities

The purpose of general entities is to support reuse of text at multiple places in a document by placing entity references for shared declared general entity as follows:

<!DOCTYPE doc [
	<!ENTITY text "some <i>reusable</i> text">
	<!ELEMENT doc - - (p+)>
	<!ELEMENT i - - (#PCDATA)>
	<!ELEMENT p - - (#PCDATA|i)>
]>
<doc>
	<p>First use of the "text" entity follows: &text</p>
	<p>Second use of the "text" entity follows: &text</p>
</doc>

In the example, &text is a reference to the previously declared text (general) entity, and will expand to the string some <i>reusable</i> text in place.

Any markup contained in the entity replacement text will be interpreted as if it had been part of the text in which the entity reference is placed. This means that replacement text can contain tags (or any other SGML content construct such as marked sections, processing instructions, etc.). It may also contain further entity references in turn, which will be expanded in place recursively.

However, valid replacement text for an entity must not contain references to the entity being replaced itself (or, transitively, contain an entity reference expanding into a reference to the entity being expanded itself).

General entity references are expanded anywhere in content, regular attribute specifications, and replacement content of general entities, except in CDATA marked sections, CDATA content, data text entities (CDATA/SDATA entities), and attributes declared with data attribute specifications.

General entity references (as opposed to parameter entity references) aren't expanded in markup declarations.

General entities are lazily fetched at the time(s) an entity reference is parsed in content. When processing an entity declaration with replacement text containing references to further entities, no check is performed whether referenced entities are declared and/or accessible. In particular, unlike parameter entities, at declaration time, replacement text for general entities may contain references to other entities that aren't themselves declared (yet).

See also general entity examples.

External general entities

Rather than specifying the replacement text for an entity literally, it's also possible to specify that replacement text should be retrieved from an external resource (such as a file or via HTTP) by declaring the entity as follows:

<!ENTITY ent SYSTEM "filename.txt">

where the part beginning with SYSTEM ... (containing a file name in the example) is an identifier.

Data text entities

For entities declared as follows

<!ENTITY ent CDATA "escaped replacement text">

or, equivalently,

<!ENTITY ent SDATA "escaped replacement text">

entity referencesare expanded into the respective literal replacement text without further interpretation of the replacement text as markup. If the replacement text contains characters or character sequences that would be interpreted as markup delimiters (such as the < or & characters), then those characters will be expanded into character entity references.

Consequently, general entity references and tags aren't recognized in data text entities; note, however, that the replacement text literal in a data text entity declaration is subject to parameter entity replacement.

In sgmljs.net, CDATA and SDATA data text entities are treated identically.

Processing instruction data text entities

Apart from CDATA and SDATA, also the PI keyword can be used in data text entity declarations.

This variant introduces an entity containing a processing instruction, and is the only variant that can also be used with parameter entities.

References to PI data text entities can only be used in a context where a processing instruction can be used; specifically, PI data text general entity references can't be used in attribute values.

External data text entities

In sgmljs.net, an external data text entity is declared using the syntax for CDATA and SDATA data entities, explained below.

Character entity references

Character entity references are strings of the form &#NNNNNN where NNNNNN is a decimal number, or of the form &#xMMMMMM where MMMMMM is a hexadecimal number. The number refers to the code point in the document character set (Unicode) represented by the character entity reference.

Character entity references are passed as-is to the output; all browsers and markup processing tools are expected to be able to handle character entity references.

Parameter entities

Entity declarations with a % character following the ENTITY keyword introduce parameter entities. Where general entity declaration define replacement text for content, parameter entities define replacement text in markup declarations.

For example, the following document type declaration set contains a declaration for the idattr parameter entity. The parameter entity is then referenced twice in further declarations.

<!DOCTYPE doc [
	<!ENTITY % idattr "id ID #IMPLIED">
	<!ELEMENT doc - - (#PCDATA|p|ul|a)>
	<!ELEMENT p - - (#PCDATA)>
	<!ELEMENT ul - - (li+)>
	<!ELEMENT li - - (#PCDATA)>
	<!ELEMENT a - - (#PCDATA)>
	<!ATTLIST doc %idattr>
	<!ATTLIST p %idattr>
	<!ATTLIST ul %idattr>
	<!ATTLIST li %idattr>
	<!ATTLIST a href CDATA #IMPLIED %idattr>
]>
...

Similar to general entity references, the %idaddr parameter entity reference is expanded into the replacement text

id ID #IMPLIED

so that all elements will have the same id attribute declaration as result.

Furthermore, the a element will have the href attribute in addition to the id attribute. Note that the purpose of reusing an attribute declaration can also be achieved by using a name group - a list of element names - in an ATTLIST declaration (and furthermore could also be achieved using WebSGML's #ALL keyword in place of an element name or name group).

A parameter entity reference must begin with the % character. A parameter entity declaration must have whitespace between the % character and the subsequent parameter entity name.

Apart from reusing parts of declaration text, parameter entities are used in particular for

  • customizing a generic external declaration set by overriding default declarations for parameter entities in the internal declaration set; see declaration sets

  • as placeholder for keywords in marked sections

  • designing declaration set text for reuse in general.

Unlike general entities, parameter entities are fetched eagerly as soon as an external parameter entity declaration is processed. Therefore, it is an error for the replacement text of a parameter entity to contain unresolved references to (other) parameter entities; references to parameter entities already declared in a prior declaration (in markup declaration text order), on the other hand, are recognized and expanded in parameter entity replacement text.

Parameter entities can also be used for fetching external content when external content can't or shouldn't be fetched multiple times as would be the case for external general entities, for example when fetching an external service response into a parameter entity for use in multiple references. when fetching from the standard input or from a network stream.

Parameter entity references are expanded in the replacement text for general entities (as well as in all other markup declaration except system identifier literals). This means that any parameter entity value can be re-declared (copied) as general entity by placing a parameter entity reference into the replacement text for a general entity.

Note that parameter (or general) entity references aren't expanded in system identifier literals (of markup declarations using external identifiers, such as entity and notation declarations). To construct a system identifier from a parameter entity, an additional, derived parameter entity is declared consisting of a reference to the parameter entity to construct from, with leading and trailing quote characters added; the derived parameter entity is then used as system identifier literal.

See also parameter entity examples.

External Parameter Entities

Like general entity declarations, parameter entity declarations can point to a system identifier (a file or network location to fetch character data from), rather than providing inline replacement text as parameter literal.

System-specific Entities

An entity declaration with omitted system identifier literal but containing SYSTEM, such as the following

    <!ENTITY ent SYSTEM>

declares an entity which is resolved by default to the filename ent. The file is searched for in the same directory as the file declaring it (the resolved value or the directory to search can be changed using runtime parameters).

Any entity that can be declared as external entity (general, data and parameter entities) can be declared system-specific.

Implied Entities

When IMPLYDEF ENTITY YES is specified in the SGML declaration, general entity references to undeclared entities will be resolved as system-specific entity. This means there is no need to specify an entity declaration at all; entities can be referenced right away provided the entity name can be resolved as file name, or another resolution rule has been provided as invocation parameter.

Parameter entities, on the other hand, must always be declared. Note, however, that external data text entities can't be declared system-specific.

Data Entities

Entities can be declared to be in a notation as follows (where we first declare a notation to reference its name in the entity declaration):

<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY someent SYSTEM "some-entity" NDATA somenotation>

Entities declared like this are not considered SGML character data and won't be expanded into replacement text when used in an entity reference.

Instead, the SGML processor just reproduces entity reference for these as-is; special processing can be implemented and associated with a notation (ie. with a public identifier of a notation) via notation handlers and the SGML API. A standard notation handler is provided by the templating feature.

Data entities declared using the CDATA or SDATA keywords in place of NDATA, on the other hand, will be expanded into the respective replacement text when used as entity reference:

<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY ent SYSTEM "some-entity" CDATA somenotation>

An entity reference to ent will be expanded into the text contained in the "some-entity" file; as with data text entities, special characters such as < or & are escaped in the replacement text, and not treated as markup delimiters.

See also data entity examples.

Providing values for data attributes

If a notation has data attributes, values for the data attributes can (or must, if no #FIXED or default values are provided) be specified as shown in the following example:

<!NOTATION notation n system "some system id">
<!ATTLIST #NOTATION n x CDATA #IMPLIED y CDATA #IMPLIED>
<!ENTITY e SYSTEM "another system id" NDATA n [ x="val1" y="val2" ]>

where the first two declarations establish a notation with data attributes x and y, and the NDATA entity declaration for the e entity demonstrates the syntax for providing data attribute values.

Short references

Short references are a facility to replace short spans of punctuation mark and other characters in text content, such as dots, commas, tabs, brackets, spaces, and others by entity references in a context-dependent way. For example, short references can be used to replace a sequence of two hyphen-minus characters (--) into an ndash character (roughly, a dash the width of an n character). Using short references, text can be typed using the hyphen-minus characters entered via standard keyboard keys, yet can be rendered using the typographically and semantically more desirable ndash character where appropriate.

Short reference map declaration

<!SHORTREF shortref-map-name shortref-delimiter replacement-entity-name
                            [shortref-delimiter replacement-entity-name] ...>

Short references are declared in a short reference map declaration for a named short reference map as shown in the following example, which declares that a sequence of two hyphen-minus characters should be replaced by a a reference to the mdash-ent entity, which in turn maps to the character entity reference for mdash (Unicode code point 8212 in decimal) in text portions when the my-shortref-map short reference map is active:

<!ENTITY ndash-ent "&#8211;">
<!SHORTREF my-shortref-map
	"--" ndash-ent>

As shown, a short reference map maps short reference delimiters to general entity names, rather than to replacement text directly.

Short reference use declaration

<!USEMAP shortref-map-name element-name>

or

<!USEMAP shortref-map-name name-group>

To then activate the my-shortref-map short reference map within the text content of a specific element (P in this example) or a group of elements, a short reference use declaration is used:

<!USEMAP my-shortref-map P>

Short reference map declarations can map more than a single short reference delimiter to an entity, as shown in the following example, which, in addition to mapping double hyphen-minus characters, also maps quotation mark (U+0022 QUOTATION MARK) characters to typographic citation mark characters (U+201C LEFT DOUBLE QUOTATION MARK, represented by &#8220; as decimal character entity reference), which might be typographically more appealing, depending on the text language and typographic conventions:

<!ENTITY ndash-ent "&#8211;">
<!ENTITY curlyquot-ent "&#8220;">
<!SHORTREF enhanced-typography
	"--" mdash-ent
	'"'  curlyquot-ent>
<!USEMAP enhanced-typography p>

Of course, when the HTML predefined entities are declared in the SGML declaration (such as when processing .html or .md files, or when a SGML declaration activating the HTML predefined entities is put in the file to process, as shown here), a short reference map can directly refer to predefined entities, rather than having to declare mdash-ent and curlyquot-ent in the prolog:

<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
	<!ELEMENT body - - (p+)>
	<!ELEMENT p - - (#PCDATA)>
	<!SHORTREF enhanced-typography
	           "--" ndash
	           '"'  ldquo>
	<!USEMAP enhanced-typography p>
]>
<body>
<p>"Murder" she said -- aka '4:50 from Paddington'</p>
</body>

The example imports the predefined entities for HTML, declares a tiny HTML-like vocabulary and uses the quot element to enclose quotations (though in HTML, quot elements would probably not be used for marking up this particular inline quote in the way shown). Moreover, two short reference maps and uses are declared: one for when in child content of p, starting a quot element, and another one for ending the quot element from within quot content.

Invoking sgmlproc on the above content will produce

<body>
<p>&#8220;Murder&#8220; she said &#8211; aka '4:50 from Paddington'</p>
</body>

Short reference use declaration in content

As the example shows, both double quote characters contained in the text will be replaced by &ldquo; character entity references. But typically, it will be desired to replace quote characters in pairs such that only the quote character beginning a quote will produce &ldquo; (U+201C LEFT DOUBLE QUOTATION MARK), while quote characters ending a quote will produce &rdquo; (U+201D RIGHT DOUBLE QUOTATION MARK).

To achieve this, a short reference use declaration can be placed inline in content, as opposed to in the DTD. The following example places a short reference use declaration in content (in addition to using a short reference use declaration in the DTD) to toggle into a short reference map in which the replacement text for the double quote character maps to &rdquo, thus closing properly closing the quotation:

<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
	<!ELEMENT body - - (p+)>
	<!ELEMENT p - - (#PCDATA)>
	<!SHORTREF quotation-formatting1 '"' ldquo>
	<!SHORTREF quotation-formatting2 '"' rdquo>
	<!USEMAP quotation-formatting1 sub>
]>
<doc>
	<sub>
		"<!USEMAP quotation-formatting2>Murder" she said
	</sub>
</doc>

When placed in content, a short reference use declaration doesn't associate one or more element names to a short reference map, but instead immediately makes the specified short reference the current one. The short reference map remains current as long as the element in which the short reference use declaration is placed remains current, and is reset (and possibly assessed from short reference map declaration in the DTD) when another element becomes current.

As shown, this isn't very useful yet; the desired effect could be achieved much simpler by just using the character entity references for typographic quotations directly in content. However, a short reference use declaration can also be placed into the replacement entity text for a short reference map itself, as shown next:

<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
	<!ELEMENT body - - (p+)>
	<!ELEMENT p - - (#PCDATA)>
	<!ENTITY quotation-open '&#8220;<!USEMAP quotation-formatting2>">
	<!ENTITY quotation-close "&#8221;">
	<!SHORTREF quotation-formatting1 '"' quotation-open>
	<!SHORTREF quotation-formatting2 '"' quotation-close>
	<!USEMAP quotation-formatting1 sub>
    ]>
    <doc>
            <sub>
                    "Murder" she said
            </sub>
    </doc>

As before, SGML will, upon encountering the first double quote character, place the #&#8220; replacement entity text into the result document; then SGML will place <!USEMAP quotation-formatting2> into the result document as well, which will make SGML immediately switch into replacing the subsequent, second double-quote character into &#8221;.

A short reference use declaration in content can contain the literal string #EMPTY in place of a short reference map name. Upon encountering such a short reference use declaration in content, SGML will disable recognizing and replacing any short reference delimiters until a new short reference map is made current by either another short reference map declaration in content, or by opening or closing elements such that an element context is becoming current which does have a short reference map associated via regular short reference map declaration in the DTD.

Replacement text containing markup

While the above sketched solution solves the basic problem of placing typographic quotation marks around portions of text, manually placing short reference map use declarations in content to toggle short reference maps is too limiting when the text span to put into quotation marks needs to contain further span-level markup for formatting. For example, quoting a mathematical expression containing superscripted text (eg. for representing exponentiation) or other formatting can only be achieved by meticulously placing short reference use declarations into the short reference use maps of all elements which should be accepted in quotations.

As an alternative, the replacement entities of a short reference map can contain markup tags:

<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
	<!ELEMENT body - - (p+)>
	<!ELEMENT p - - (#PCDATA|quot)+>
	<!ELEMENT quot - - (#PCDATA)>
	<!ENTITY start-quote "<quot>">
	<!ENTITY end-quote "</quot>">
	<!SHORTREF in-p
		"--" mdash
		'"'  start-quote>
	<!SHORTREF in-quote
		'"'  end-quote>
	<!USEMAP in-p p>
	<!USEMAP in-quote quot>
]>
<body>
<p>"Murder" she said -- aka '4:50 from Paddington'</p>
</body>

While the example markup in itself doesn't put citation marks around quoted text, it delegates formatting quotations for presentation to HTML/CSS or other mechanism able to attach element-specific formatting and processing rules such as SGML templating.

Generating start- and end-element tags via short references is so common that a (general) entity declaration can optionally be declared with a bracketed text type of STARTTAG to put STAGO (<) and ETAGO > characters before and after, respectively, replacement text, forming a start-element tag. Likewise, when a replacement entity named in a short reference map is declared to have bracketed text type ENDTAG, then ETAGO (</) and ETAGC (>) are put around the replacement text when referenced in content. For example, the following two declarations can be used in place of those in the example before to produce exactly the same result:

<!ENTITY start-quote STARTTAG "quot">
<!ENTITY end-quote ENDTAG "quot">

Reference concrete syntax short reference delimiters

The following short reference delimiters can be used as the literal to be recognized and replaced in short reference map declarations:

&#TAB;
&#RE;
&#RS;
&#RS;B
&#RS;&#RE;
&#RS;B&#RE;
B&#RE;
&#SPACE;
B
BB
"
#
%
'
(
)
*
+
,
-
--
:
;
=
@
[
]
^
_
{
|
}
~

where

B

recognizes one or more space or tab characters

BB

recognizes two or more space or tab characters

&#TAB;

recognizes a single tab character

&#SPACE;

recognizes a single space character

&#RS; and &#RE;

recognizes a record-start, and record-end character; see notes below

&#RS;B

recognize a newline followed by a blank character

&#RS;&#RE;

recognizes an empty line (two consecutive newline characters)

&#RS;B&#RE;

recognizes a line containing a single space character

B&#RE;

recognizes a blank character followed by a newline character

other characters

are recognized when the respective character occurs verbatim in content

Current shortref map

A short reference map is current while the associated element is the top-most one. Sub-elements descending from an element with an associated short reference map, like all other elements, don't have a current short reference map unless declared explicitly to have one in a short reference use declaration. That is, the current short reference map isn't inherited to child content elements.

Short reference maps associated to elements inferred at a context position by opening contextually required elements to accomodate character data are not asserted to be current by sgmljs.net SGML. For example, in the following instance

<!DOCTYPE doc [
	<!ELEMENT doc - - (sub+)>
	<!ELEMENT sub O O (subsub+)>
	<!ELEMENT subsub O - (#PCDATA)>
	<!ENTITY start-sub STARTTAG "sub">
	<!ENTITY end-subsub ENDTAG "subsub">
	<!SHORTREF in-doc "&#RS;" start-sub>
	<!SHORTREF in-subsub "|" end-subsub>
	<!USEMAP in-doc doc>
	<!USEMAP in-subsub subsub>
]>
<doc>
sometext |</doc>

sgmljs.net SGML will

  • within doc element content, produce a sub start-element tag (as per the start-sub replacement entity of the in-doc short reference map)

  • within sub content content, infer a subsub start-element tag (by subsub's tag omission indicators and the rules for tag inference)

but will not

  • recognize | (the vertical bar character) in the sometext | text span, since the short reference map for the inferred element subsub only becomes current after inference of the subsub start-element tag.

Note the behaviour of short reference maps in the presence of tag omission/inference is regarded ambiguous in the SGML specification and thus not fully portable across SGML systems.

Record boundary insertion

For recognition of the &#RS and &#RE short reference delimiters, SGML considers input text to consist of records which are text lines starting with a record start (RS) and ending in a record end (RE) character. Actual text content in text fetched via parsed entities will most of the time contain just newline characters (mapped to an SGML RS character) or other character sequences as line terminators. For input data without RE characters, RE characters, if not present, are inserted into text content prior to short reference delimiter recognition at the text position right before RS characters, except at the end of a text chunk or the end of an external entity, where line termination characters are removed alltogether, and replaced by a single RE character.

Short reference delimiters thus can use the &#RS and &#RE characters as reliable signals/anchors for applying substitutions at the begin and end of text lines instead of having to rely on newlines/line terminators which are placed after every text line, even those not followed by further text. For example, to parse text content formatted as tab-separated (or comma-separated) values, declararations such as the following can be used:

<!DOCTYPE tsv [
	<!ELEMENT tsv - - (record+)>
	<!ELEMENT record - - (field1,field2,field3)>
	<!ELEMENT field1 - -  (#PCDATA)>
	<!ELEMENT field2 - - (#PCDATA)>
	<!ELEMENT field3 - - (#PCDATA)>
	<!ENTITY start-fields "<record><field1>">
	<!ENTITY end-record "</field3></record>">
	<!ENTITY end-field1 "</field1><field2>">
	<!ENTITY end-field2 "</field2><field3>">
	<!SHORTREF in-tsv "&#RS;" start-fields>
	<!SHORTREF in-field1 "&#TAB;" end-field1>
	<!SHORTREF in-field2 "&#TAB;" end-field2>
	<!SHORTREF in-field3 "&#RE;" end-record>
	<!USEMAP in-tsv tsv>
	<!USEMAP in-field1 field1>
	<!USEMAP in-field2 field2>
	<!USEMAP in-field3 field3>
]>
<tsv>
A       B       C
1       2       3
</tsv>

(where the spaces between the A, B, C and the 1, 2, 3 items represent single tab characters).

Note: the notion of records is applicable to any input data parsed by SGML, not only to element content in the presence of short reference declarations. However, sgmljs.net SGML only considers record boundaries in the context of short reference delimiter recognition, and otherwise behaves according to the FEATURES OTHER KEEPRSRE YES setting in the SGML declaration.

Order of recognition of short reference delimiters

Within a short reference map with multiple delimiters, the declared delimiters are matched against input text data in an order honoring their relative norminal length and specificity as follows:

Short reference delimiter literals are compared against already declared (in document order) short delimiter literals of the same short reference map based on the number of significant tokens such that &#RS;, &#RE;, &#SPACE;, and &#TAB; are each counted as a single token, and all other single characters (including B characters) are each also counted as a single token. Delimiter literals having more tokens are considered more specific, and wil be matched prior to delimiter literals with fewer tokens. When two delimiter literals have the same number of tokens, their respective tokens are compared individually, such that if a token is B, and the corresponding token in the delimiter to compare is either &#SPACE; or &#TAB;, then the token with &#SPACE; or &#TAB; is considered more specific (and will be matched against text input before the delimiter with the B token). For short reference delimiter literals with equal length and specificy, the declaration order will be used as fallback to determine the order of matching text input against short reference delimiters.

Note: this will make a literal ending in either #&TAB; or #&SPACE; match a string with potentially unmatched subsequent space characters, when the whole sequence, including unmatched subsequent space characters could be matched by a short reference delimiter ending in B rather than #&SPACE or &#TAB and otherwise being identical (eg. because B represents one or more blanks). That is, the comparison ranks the specificity of a short reference delimiter higher than the maximal length of a matched text span.

Note this is also considered behaviour not sufficiently elaborated in the SGML specification, and prone to limited portability across SGML systems.

Marked Sections

Marked sections are for including or ignoring, respectively, a portion of SGML prolog or content, optionally depending on the value of a parameter entity.

For example, the following example contains a marked section around a content portion:

<!DOCTYPE test [
	<!ELEMENT test - - (#PCDATA|a)>
	<!ENTITY % condition "INCLUDE">
]>
<test>
	The following hyperlink is included or
	ignored based on the `condition` parameter
	entity:
	<![ %condition [
		<a>Hyperlink text</a>
	]]>
</test>

The SGML processor will reproduce the <a>Hyperlink text</a> text in its output because the effective value of the %condition parameter entity is INCLUDE; if it were IGNORE instead, the document is treated as if <a>Hyperlink text</a> weren't contained in the document.

Moreover, the document prolog may contain marked sections, too. In the following document, the attribute declaration will be only be applied if the condition parameter entity has the value INCLUDE:

<!DOCTYPE test [
	<!ELEMENT test - - (#PCDATA|a)>
	<!ENTITY % condition "INCLUDE">
	<![ %condition [
		<!ATTLIST test testatt CDATA #REQUIRED>
	]]>
]]>
<test testatt="some text">Some other text</test>

A further use case for marked section (CDATA and RCDATA marked sections) is to prevent interpretation of markup delimiters in portions of text.

A marked section

  • begins with the character sequence <![,

  • followed by one or more marked section keywords,

  • followed by the [ character (possibly with whitespace before and/or after),

  • followed by the marked section text, and

  • closed with the character sequence ]]>.

Keywords have the following meaning:

INCLUDE

means the portion wrapped in the marked section will be included; the marked section effectively is replaced by the wrapped marked section text

IGNORE

means the marked section is ignored, ie. skipped

TEMP

is equivalent to IGNORE; offers a way to mark up editorial content such as author comments without having to use IGNORE

CDATA

the marked section text is interpreted as verbatim text without interpreting markup delimiters and entity references

RCDATA

the marked section text is interpreted as verbatim text without interpreting markup delimiters except general and character entity reference start characters

If no keyword is encountered (ie. if the parameter entity is expanded into blank text or if a construct such as <![[ text ]]> is used), the marked section will be treated as if INCLUDE were specified.

If multiple keywords are encountered (if the parameter entity expands to multiple keywords, or if multiple parameter entities are used, each of which expanding into a keyword), if IGNORE is among them, the marked section is treated as if it were a IGNORE section. That is, IGNORE has highest precedence, followed by TEMP, CDATA, and INCLUDE.

Marked sections other than CDATA and RCATA marked sections can be nested up to four levels (ie. marked section text can contain further marked sections, etc.).

Marked sections can contain any SGML construct valid in the context where the marked section is placed.

Marked sections only apply to top-level SGML constructs, and can't be used within e.g. attributes.

Note that it generally doesn't make sense to create a marked section and use parameter entities to switch parsing behaviours between CDATA and either INCLUDE or IGNORE because of how CDATA marked sections are parsed.

Note that sgmljs.net SGML doesn't support external entities in RCDATA marked sections; as a workaround, it's possible to pull external content into a parameter entity, then reference that parameter entity in the replacement text literal of a general entity, and then reference that general entity in an RCDATA marked section.

See marked section examples.

Identifiers

Literals used for file and other resource names of external entities, declaration sets, notations, or other SGML components are called identifiers in SGML terminology.

Apart from system identifiers which were already used above, SGML also has public identifiers. Public identifiers don't name a physically existing or otherwise accessible resource, but identify a symbolic resource known to the SGML processing system out of band instead of or in addition to a system identifier.

For example, the DTD for HTML 4 (containing declarations for all markup features understood by web browsers up until HTML 5 became generally accepted as standard) can be referenced via the public identifier -//W3C//DTD HTML 4.01//EN without reference to a physical location of a DTD file. Using a public, well-known identifier for this purpose is appropriate since a web browser is usually hard-coded to interpret a particular markup language (such as HTML, SVG, and MathML), and isn't designed to render dynamic markup languages at runtime. Using a system identifier, on the other hand, isn't beneficial here since it would have to be treated as a constant rather than an actually accessible resource by browsers anyway.

Since the introduction of SGML, Uniform Resource Locators (URLs) and variants have become widely used for locating and identifying resources on the web, similar to the purpose of system and public identifiers, respectively. Hence, SGML has been extended to allow the use of URLs as both system and public identifiers. While any URL can be used as system identifier as long as a resource can be located using it, public identifiers also need to include an owner identifier as a prefix which identifies a naming authority and a public text type which identifies the role of the virtual resource identified within a DTD. Therefore, URLs for public identifiers (e.g. for formal public identifiers) are required to have the particular syntax described below.

The following examples show how to declare entities with system identifiers, with public identifiers, and with both public and system identifiers, respectively:

<!ENTITY ent PUBLIC "pubid">
<!ENTITY ent SYSTEM "sysid">
<!ENTITY ent PUBLIC "pubid" "sysid">

Declarations for notations with public, with system, and with both public and system identifiers look very similar:

<!NOTATION n PUBLIC "pubid">
<!NOTATION n SYSTEM "sysid">
<!NOTATION n PUBLIC "pubid" "sysid">

System Identifiers

In most cases, a system identifier is just a path string such as "a/b/c" (using the forward-slash character as separator). Like with URLs used in HTML href or src attributes, the path is resolved relative to the SGML document or DTD from which it is referenced. Hence, a path string can be used both to reference a file (when processing a local SGML file) and a resource accessed via e.g. the HTTP protocol (when accessing a remote SGML document via a network).

Formal System Identifiers

If support for formal system identifiers is enabled (which it is by default), a system identifier can also take the form of a string such as

<url base='http://localhost/dir'>file

called a formal system identifier.

The syntax for formal system identifiers resembles the syntax for markup elements with optional attributes. The string used as the pseudo-"element" of a formal system identifier, however, must be declared in a storage manager notation declaration rather than an element declaration, and the pseudo-"attributes" of a formal system identifier must be declared as data attributes of the storage manager notation.

Note that not all kinds of formal system identifiers are supported in all system identifier roles as indicated in subsequent sections.

Storage manager notations

The osfile, url, osfd, and literal storage manager notations are part of the "storage manager notation starter set" as defined in ISO 10744:1997 (HyTime 2nd ed.), and are also available for use in third-party SGML systems such as (Open)SP.

sgmljs.net SGML has the additional storage manager notations exec, strftime, strptime, script, and sql as described below. For portability of SGML documents, usage of these storage manager notations must be declared in a FSI processing instructions in the declaration set where they are used, whereas osfile, url, osfd, and literal must not be declared in a FSI processing instruction.

URL storage manager notation

The URL storage manager notation provides access to resources that can be addressed using a Uniform Resource Locator as defined in RFC 3986/RFC 6974. Note while an URL formal system identifier can represent resources in a large variety of storage protocols and representation schemes, sgmljs.net SGML can only access URLs having the http: or https: scheme/protocol.

URL storage manager notation identifiers without an explicit scheme will be interpreted relative to the URL of the SGML file in which an entity or other resource making use of the FSI is declared. In the cases discussed within sgmljs.net SGML reference manuals, this will either be a file: URL or a http:/https: URL. The intepretation of a scheme-less URL storage manager notation identifier is just the same as with an informal system identifier as used in the examples for external entity declarations. However, a URL imposes specific requirements with respect to encoding of special characters in resource names.

The URL storage manager notations behaves as if declared as follows:

<!NOTATION url 
	PUBLIC "-//IETF/RFC1738//NOTATION
	        FSISM PORTABLE Uniform Resource Locator//EN">
<!ATTLIST #NOTATION url
	base CDATA #IMPLIED>

The url storage manger notation can be used as a derived storage manager notation in a custom storage manager declaration.

For example, the following declaration declares the value of the ent parameter entity as the URL formed by resolving image1.png relative to a site-wide used path for storage of images, rather than relative to the document in which the declaration is placed:

<!NOTATION myurl SYSTEM>
<!ATTLIST #NOTATION myurl
	superdcn NAME #FIXED url
	base CDATA #FIXED "/images">
<?IS10744 FSIDR myurl>
<!ENTITY % ent
	SYSTEM "<myurl>image1.png">

The superdcn attribute has a #FIXED value of url, declaring to SGML that it should be treated as a storage manager notation derived from the built-in url storage manager notation.

OSFILE storage manager notaton

A storage manager notation identifier can begin with <osfile>, followed by a string interpreted as file name; this option is used to override interpretation of the identifier as URL path (such as when interpretation of URL percent-encoding is undesired).

OSFD storage manager notation

A storage manager notation identifier can being with <osfd>, followed by a file descriptor number in the range 0-4. The file descriptor number corresponds to one of those specified by POSIX (IEEE Std 1003.1-2008) for the standard file descriptors of a Unix process. For example, <osfd>0 represents the standard input, <osfd>1 represents the standard output, and <osfd>2 represents the error output of the Unix process of the SGML processor parsing the SGML document. Usage of osfd storage manager notation system identifiers is explained in templating.

LITERAL storage manager notation

The system identfier for an entity declaration can be a string beginning with <literal>, followed by literal replacement text for the entity; this form of system identifier is functionally equivalent to using the the literal text as replacement text in an entity declaration.

For example, the following declarations result in general entities expanding to the same value:

<!entity e "replacement text">
<!entity f system "<literal>replacement text">

EXEC storage manager notation

The system identifier of a parameter entity declaration, can contain a string beginning with <exec, and specifying an executable Unix shell command in its cmd pseudo-attribute. The value of a parameter entity declared with an exec formal system identifier is the output (Unix standard output) of the command being executed.

For example, the following declaration establishes the file-listing parameter entity containg character data produced as output by the Unix ls file listing program (with *.txt as parameter):

<!entity % file-listing system "<exec cmd='ls *.txt'">

The program is executed with the directory of the SGML file containing the entity declaration as current working directory.

The program input is declared as the element content of the exec storage manager pseudo-element; for example, the following declaration establishes the str entity containing the result of replacing b by h on the input Abba supplied as Unix standard input to the tr program for replacing characters:

<!entity % str system "<exec cmd='tr b h'>Abba">

Note while the value for cmd can also be specified using single-quote characters, the declaration syntax puts restrictions on the simultaneous use of the single- and double-quote shell meta characters in cmd values.

Moreover, the exec storage manager notation isn't generally available in web processing contexts for security reasons (and where it is available, will be restricted in terms of file access and available commands).

The exec storage manager notation behaves as if declared as follows:

<!NOTATION exec
	PUBLIC "+//IDN sgml.net//NOTATION
	        FSISM POSIX Shell Command Language//EN">
<!ATTLIST #NOTATION exec
	cmd CDATA #REQUIRED
	in CDATA #IMPLIED>

As an alternative to specifying the characters that comprise the standard input for the command, input can alternatively specified using the in parameter. The in parameter can have the value <osfd>0, in which case the standard input stream of the SGML processor parsing the SGML document declaring the parameter entity is used/inherited as standard input for the command being executed.

The exec storage manager notation can be used as a derived storage manager notation in a custom storage manager declaration. Data attributes declared for a custom storage manager declaration deriving from exec are interpreted as Unix environment variables to be set in the execution environment of cmd.

For example, the following declaration declares the value of the ent parameter entity as the output of executing echo $PARAM on a Unix command line shell with PARAM set to the value MyParam:

<!NOTATION custom SYSTEM>
<!ATTLIST #NOTATION custom
	superdcn NAME #FIXED exec
	PARAM CDATA #IMPLIED>
<?IS10744 FSIDR exec custom>
<!ENTITY % ent
	SYSTEM "<custom cmd="echo $PARAM" PARAM='MyParam'>">

As shown, the use of the exec storage manager notation, as well as any custom storage manager notations deriving from it, must be declared in a FSI processing instruction such as

<?IS10744 FSIDR exec custom>

STRPTIME and STRFTIME storage manager notations

The strptime and strftime storage manager notations implement date parsing and formatting, respectively, according to POSIX specifications and provide formatting a Unix epoch time (the time in milliseconds since the "epoch" eg. since January 1st, 1970) into a human-readable date/time representation format (using strftime), or vice-versa (using strptime).

As implied by its name, strftime implements parts of the POSIX (IEEE Std 1003.1-2008) date and time template format described as part of the ISO C standard. See

http://pubs.opengroup.org/onlinepubs/009695399/functions/strftime.html
for reference.

A POSIX strftime/strptime date and time template consists of conversion specifiers for the format of individual date and time components intermixed with space or interpunction characters and is used for parsing a given date representation such as Tue Jan 23, 2010 into the Unix epoch time numerical representation, as well as formatting a numerical representation into a human readable form.

Compared to the full POSIX specification for strftime and strptime, only the following subset of conversion specifiers are supported:

  • %a (three-letter day of week in the international locale; eg. "Mon")

  • %b (three-letter month of the year in the international locale; eg. "Apr")

  • %Z and %z at the end of the format literal (the literal letter "Z")

  • %m (month of the year as a two-digit decimal value with leading zero)

  • %Y (year as four-digit decimal value)

  • %d (day of the month)

  • %H (hour of the day)

  • %M (minute of the hour)

  • %S (second of the minute)

In addition to the conversion specifier character %, the $ character is supported. The % character can be an unfortunate choice when a strptime or strftime notation manager system identifier literal is assembled from parameter entities

Note that the system identifier in an entity declaration literal itself isn't subject to parameter entity expansion. Instead, the complete identifier literal construct, including surrounding quotation characters, must be expanded from a parameter entity at the place where the system identifier literal construct is expected in an entity declaration to make use of entity expansion on system identifiers.

strptime and strftime date and time template format according to POSIX specification requires both conversion specifiers in the date/time pattern as well as the matching text tokens in the parsed or formatted value to be separated by spaces or interpunction character tokens.

For example, whereas %d %m %Y is a valid template, %d%m%Y is not since the value to format or parse needs to present conversion specifiers separated by space or interpunction characters. However, as an exempt to this rule, in sgmljs.net SGML,

  • the common sequences %Y%m and %y%m are supported when appearing as the only conversion specifiers (eg. with other text being either absent or consisting of just leading and/or trailing boilerplate text)

  • the ISO 8601 date/time formats and the format used for date/time representation in RFC 2616 (the HTTP/1.1 specification) are supported

As an example for parsing an HTTP date, the parameter entity declaration

<!ENTITY % d SYSTEM "<strptime fmt='%a, %d %b %Y %H:%M:%S %Z'>Tue, 26 Mar 1996 22:20:12 GMT">

will result in the value 827878812 (the number of milliseconds since the given date/time).

For parsing an ISO 8601 combined date/time, an entity declaration similar to the following can be used:

<!ENTITY % d SYSTEM "<strptime fmt='"%Y-%m-%dT%H:%M:%S%z'>1996-03-26T22:20:12Z">

SCRIPT storage manager notation

The script storage manager notation is used to invoke an ECMAScript function and make the result of its execution available as the replacement text for a parameter entity. Note there's no support for using a script FSI directly in the declaration of a general entity; general entities can receive a script'-genrated value only by copying from a parameter entity (eg. by referencing a parameter entity declared using ascript` FSI in the replacement text for general entity).

An entity declaration using a script FSI takes either of the following forms:

<!ENTITY % e SYSTEM "<script>return 'hello'">

or

<!ENTITY % e SYSTEM "<script module='mymodule' function='myfunction'>">

script FSIs specifying the ECMAScript code text directly as storage object identifier (as inline content of the <script> pseudo-element) are always executed synchronously, and represent the ECMAScript expression returned from executing the specified ECMAScript code as it were evaluated by constructing a ECMAScript Function object and invoke it with any storage manager notation data attributes bound to the accordingly named function parameters.

Script code text specified inline can be dynamically executed or can be bundled, depending on the level of support for the script for sgmljs.net SGML application binary. Bundled code is code becoming part of a sgmljs.net SGML application binary at build time, rather than at the time of processing SGML. When an sgmljs.net SGML application binary is build with bundled functions, inline code text presented in a script FSI is compared to script code text that was provided at built-time, and is expected to match the prebuilt code text exactly. For script code to be bundled into an sgmljs.net SGML application binary, it must be presented as part of a FSI definition document in a IS10744 fsidr processing instruction.

script FSIs specifying values for the module and function storage manager notation attributes are interpreted as references to a bundled (static-like) CommonJS module. The parameter replacement text is obtained by invoking the function specified in in the function data attribute within the module specified in the module data attribute.

Like bundled inline-provided code text, script FSIs specifying values for module and function must be declared as custom notation storage managers in a IS10744 fsidr processing instruction.

Invoked modules are interpreted as ECMAScript CommonJS modules with the system object being declared as member of the ECMAScript global object. Specifically, data attributes specified as part of the FSI can be accessed using the system.env map. The result of executing module code to become the replacement text of the declared parameter entity is expected to be asynchronously written to the system.stdout text stream via its write() member function, with a final call to system.stdout.end() to continue processing of the SGML document the declaration is being part of.

Bundling CommonJS modules and detailed instructions are specific to a target platform.

SQL storage manager notation pro

The sql storage manager is used to fetch content from SQL databases into SGML parameter entities; like <script> FSIs, <sql> FSIs can't be used directly in general entity declarations.

The replacement assigned to the parameter entity is the result of executing an SQL query or statement and formatting it as comma separated values (with the vertical bar character as column delimiter by default for compatibility with markdown table conventions) or as a single atomic text string (if only a single attribute value is fetched in the SQL query).

The sql storage manager notation, when used in a IS10744 FSIR declaration, and in the absence of an explicit notation declaration, is declared as

<!NOTATION sql
	PUBLIC "SO/IEC 10744:1992//NOTATION SQL Storage Object Specification//EN">

<!ATTLIST #NOTATION sql
            connectstr CDATA #IMPLIED
            headings   (OFF|ON) ON
            underline  (OFF|ON) ON
            colsep     CDATA "|">

The storage object identifer of an <sql> FSI

  • can contain SQL statements and additional directives to control output (cf. the common SQL script syntax supported by the sqlproc tools)

  • can reference declared non-standard custom data attributes of the sql notation or a custom data storage manager notation by using the & ampersand character.

For example, the content of the query-result parameter entity in the following declaration is the result of querying the NAMES database table for all names with a particular gender_cd value:

<!ATTLIST sql gender_cd CDATA>
<!ENTITY % query-result
	"<sql gender='0'>SELECT NAME FROM WHERE GENDER_CD = '&gender';">

Note while &gender might look like a general entity reference here, substitution of &gender_cd by the actual value for gender applies rules for safe/injection-free text substitution in SQL and will only substitute references in SQL text literal content (eg. portions enclosed in single quote characters).

Custom storage manager notations

Unlike built-in storage manager notations, custom storage manager notations are notations with user-defined names declared as derived from either a url or the exec storage manager notations with (typically) fixed data attributes.

See exec-storage-manager-notation for an example of a custom storage manager notations.

Formal system identifier processing instruction

As already explained, use of the non-standard exec, strptime, strftime, and script storage manager notations must be declared in FSI processing instruction to make the functionality available to entity declarations:

<?IS10744 FSIDR exec strptime strftime>

A IS10744 FSIDR processing instructions must be placed in every declaration set (document type declaration set or link process declaration set) making use of a non-standard storage manager notation.

An IS10744 FSIDR processing instruction can reference a FSI definition document using the syntax shown in the following example:

<?IS10744 FSIDR exec strptime strftime FSIDefDoc="fsidd.declarations">

(where fsidd.declaration represents a file name and is to be replaced by the name of the actual file to use).

The referenced FSI definition document is expected to contain custom notation declarations and their associated data attribute declarations. If no FSI definition document is specified, custom storage manager notations are expected to be declared in the document and declaration set wherein the FSI processing instruction is placed.

The form of a IS10744 FSID processing instruction with a FSI definition document is used for organizing storage manager notations as coordinated resource access methods for larger sets of documents such as web sites.

Public Identifiers

A public identifier is a sequence of the ASCII characters A through Z, a through z, the decimal digits, the characters (, ), +, , (comma), . (dot), /, :, =, ?, -, and the space, newline and carriage-return characters.

Formal Public Identifiers

If FORMAL YES is specified in the SGML declaration, a public identifier must have the following syntax:

  1. Owner identifier

    • Either the string "ISO" followed by a string made of digits and : (colon) characters, followed by the string //

    • or the characters + or - followed by the string //, followed by a string not containing the / (slash) character, followed by the string //

  2. Public text class

    One of the following strings, directly following the preceding string //:

    • CAPACITY

    • CHARSET

    • DOCUMENT

    • DTD

    • ELEMENTS

    • ENTITIES

    • LPD

    • NONSGML

    • NOTATION

    • SHORTREF

    • SUBDOC

    • SYNTAX

    • TEXT

  3. Public text description

    A string of characters not containing the / (slash) character

  4. The string //

  5. Public text designating sequence (for CHARSET public text) or public text language (for other public text classes)

    A string of characters not containing the / (slash) character;

  6. The string //

  7. Public text display version

    A string of characters not containing the / (slash) character

Except for CHARSET public text, the components following public text description are optional; if the optional components are omitted, the public identifier ends in the public text description.

The public text display version is optional; if the public text display version is omitted, the public identifier ends in the public text designating sequence or public text language.

As examples for public identifiers, here are the public identifiers of HTML 4 and SGML, respectively:

-//W3C//DTD HTML 4.01//EN

ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN

URN syntax for formal public identifiers

If URN YES is specified in the SGML declaration, any public identifier in a markup declaration can also be declared using an alternative URL syntax (in addition to the standard syntax for public identifiers when FORMAL YES is specified).

Examples for the public identifiers in URN syntax corresponding to those in standard public identifier syntax above are as follows:

urn:publicid:-:W3C:DTD+HTML+4.01:EN

urn:publicid:ISO+8879%3A1986:NOTATION+Standard+Generalized+Markup+Language+(SGML):EN

Declaration Sets

Each markup declaration is part of a declaration set. A declaration is either a document type declaration set or a link process declaration set (until now we have only considered document type declaration sets, see Templating Reference for link process declaration sets).

Any SGML document prolog consists of one or more named declaration sets, as in the following example:

    <!DOCTYPE D [
        ... markup declarations ...
    ]>
    <!DOCTYPE E [
        ... markup declarations ...
    ]>
    ... document content ...

Note that standard SGML only allows multiple DTDs to occur if either the CONCUR YES or LINK IMPLICIT YES or LINK EXPLICIT YES n features are active in the SGML declaration.

Internal and external subset

Via parameter entities, a declaration subset can reference markup declarations stored in other text files or external resources. In the following example, markup declarations from the e parameter entity, declared to contain the content of external-declarations.dtd, are included in the DTD:

<!doctype D {
	... other declarations ...

	<!entity % e "external-declarations.dtd">
	%e
]>
... document content ...

The following is an alternative syntax for achieving the same:

<!doctype D system "external-declarations.dtd" {
	... other declarations ...
]>
... document content ...

An identifier specification such as system "external-declarations.dtd" is called the external subset identifier and is interpreted by SGML such that the markup declarations located or identified by it are included at the end of the declaration set being declared.

The set of markup declarations introduced via an external subset identifier are called the external subset, as opposed to the internal subset which is the set of markup declaration that are appear braced within the [ and ] delimiters of a DTD or LPD.

As a consequence of the external subset being processed after the internal subset, the internal subset can preempt ("override") entity declarations (but not other markup declarations) in the external subset. In the following example, the x entity is declared both in the internal-subset-preemption-example.sgm document and in external-declarations.dtd:

<!-- external-declarations.dtd -->
<!entity x "This">

<!-- internal-subset-preemption-example.sgm -->
<!doctype d system "external-declarations.dtd" [
  <!entity x "That">
]>
<d>&x</d>

The declaration in the internal subset gets processed first and sets the replacement text value for x (to That); the declaration for x in external-declarations.dtd is ignored, because a declaration for x is already established when external-declarations.dtd is processed.

Note that the term "parameter entity" is due to this feature; it emphasizes that the internal subset "parametrizes" or "configures" external subset defaults such that settings more specific to the document instance apply.

SGML allows the external subset to be specified by any kind of identifier, ie. allows it to be specified as a public identifier (or as both a system and public identifier), but sgmljs.net SGML can't resolve public identifiers for external subsets except for the HTML DTD described in SGML Web Reference and requires always a system identifier otherwise.

Though, technically, the result of using an external subset specification is the same as that of using an explicit parameter entity declaration and reference as in the initial example, applications may interpret the syntactic representation as an external subset identifier special; for example, in (the "lax" variant of) templating, only the formal external subset identifier, rather than merely an identifier for a named parameter entity, establishes eligibility of document fragments for inclusion into master documents.

System-specific external subsets

An external subset, either a DOCTYPE or LINKTYPE declaration set (explained in templating) can be specified as system-specific entity declaration using the following syntax:

<!DOCTYPE x SYSTEM>

or

<!LINKTYPE y ... SYSTEM>

sgmljs.net SGML resolves system-specific external declaration sets by accessing a file having a name derived from an implied declaration set name or document element, looked up in the same directory as the instance file referencing the system-specific external subset.

  • the declaration text for an external document type declaration subset referenced via <!DOCTYPE #IMPLIED SYSTEM> is looked up as e.dtd, where e is the document element name of the instance referencing the external subset

  • the declaration text for an external document type declaration subset referenced via <!DOCTYPE x SYSTEM>, where x is a regular element and document type name, is accessed from the file x.dtd

  • the declaration text for an external link process declaration setup referenced via <!LINKTYPE y ... SYSTEM> is accessed from the file y.lpd

Processing is aborted with error if a file for declaration text as described can't be accessed.

Note that using SYNTAX NAMECASE GENERAL has the consequence that the base file name (x, or y in the examples) will be converted to all-uppercase when accessing a derived file name (but .dtd or .lpd file name suffixes are always used with lowercase letters).

SGML declaration pro

Using an optional SGML declaration, it's possible to specify general properties of a document instance such as its character set, the characters used for markup delimiters, whether, and which, markup minimization features such as tag omission are used, among other things.

An SGML declaration body is a piece of plain text (as described in detail further below) contained either directly in a document instance (in an SGML declaration as the begin of the document instance) or stored in an external entity and referenced via an identifier in an SGML declaration reference at the begin of the document instance.

A conformant SGML processor isn't required to be able to process an SGML declaration; if it isn't, the information contained in an SGML declaration is provided for manual inspection and comparison against the SGML declaration(s) and features supported by the processing system.

sgmljs.net SGML, as much as possible, is designed to avoid the necessity of having to bother with SGML declarations, by

  • inferring applicable SGML declarations from file name suffixes or other out-of-band information such as HTTP/IANA media types

  • supporting the use of SGML declaration references in place of full SGML declarations (as described below)

  • allowing XML declarations to act as SGML declaration.

Note certain sgmljs.net SGML tools/builds lack support for parsing SGML declarations alltogether.

Basic SGML declaration shows the begin of a document instance with the traditional basic SGML declaration, asserting, to the processor, that the document instance is using the reference concrete syntax and other basic settings.

sgmljs.net SGML, while accepting the Basic SGML declaration below, doesn't support all features requested in this declaration (namely certain markup minimization forms requested with SHORTTAG YES mostly interesting from a historical perspective) and can't claim full conformance insofar as support for these legacy features is mandated for conformance.

For sgmljs.net SGML, the preferred SGML declaration syntax to use is the one introduced with the WebSGML (ISO 8879:1986 Annex K) revision of SGML as explained further below, which can express presence or absence of legacy features on a more granular level, and hence can more readily represent sgmljs.net SGML's feature set.

The Annex K revision of SGML has extended both the SGML declaration syntax as well as that of markup declarations; use of any WebSGML extension is indicated by using "ISO 8870:1986 (WWW)" as minimum data/literal for the initial part of an SGML declaration body. The WebSGML additions include essential changes for parsing XML and HTML. In sgmljs.net SGML, WebSGML additions are available even in the absence of an SGML declarations.

Note that SGML declaration settings are only discussed insofar as they are supported by sgmljs.net SGML:

  • sgmljs.net SGML is designed to accept markup in the reference concrete syntax (with supported WebSGML additions to cover XML and HTML as explained below), which convers basically all angle-bracket markup languages, including the fundamental syntax used of XML and HTML

  • while the SGML declaration in principle allows redefinition of function characters and delimiters (such as the < character), and of reserved names, this isn't supported by sgmljs.net SGML

  • only UTF-8-encoded document instances are consistently supported across all regular sgmljs.net SGML tools.

Basic SGML declaration

Note that an SGML declaration is rarely used in the basic form given below even in English-speaking countries because of it's restriction of the set of usable characters in the document instance to just the IRV/ASCII characters.

Whitespace (space, tab, and newline characters), as well as SGML comments (text between -- character sequences) isn't significant in SGML declarations and only provided for formatting (in the sense that any whitespace sequence can be replaced by a single space character).

<!SGML "ISO 8879:1986"
	CHARSET
		BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0"
		DESCSET
			0 9 UNUSED
			9 2 9
			11 2 UNUSED
			13 1 13
			14 18 UNUSED
			32 95 32
			127 1 UNUSED
	CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
	SCOPE DOCUMENT
	SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
	FEATURES
		MINIMIZE
			DATATAG NO
			OMITTAG YES
			RANK NO
			SHORTTAG YES
		LINK
			SIMPLE NO
			IMPLICIT NO
			EXPLICIT NO
		OTHER
			CONCUR NO
			SUBDOC NO
			FORMAL NO
	APPINFO NONE>
<!-- document prolog and content following here ... -->
"ISO 8879:1986" (minimum data)

indicates the SGML declaration syntax revision being used

sgmljs.net SGML supports all released revisions (ie. also "ISO 8879:1986 (ENR)" or "ISO 8870:1986 (WWW)" in addition to "ISO 8879:1986")

the "ISO 8879:1986 (ENR)" minimum literal asserts that the extensions to the SGML declaration syntax introduced with ISO 8879 Annex J can be used; see Extended naming rules

"ISO 8879:1986 (WWW)" asserts that, in addition, those from ISO 8870 Annex K can be used; see WebSGML

for sgmljs.net SGML, use of "ISO 8879:1986 (WWW)" is always recommended

CHARSET BASESET ... (document base character set)

this asserts that IRV/ASCII character set is used in the document instance as base character set

the literal is a formal public identifier containing the text designation sequence ESC 2/5 4/0 representing the escape sequence to (virtually) switch to the IRV coding system

SGML uses the escape sequences registered with the International register of character sets to be used with escape sequences (which complies with the ISO/IEC 2012:1986 and ISO/IEC 2012:1994, respectively) to identify character sets

sgmljs.net SGML recognizes the public identifiers for character sets as listed in Base Character Set

for an in-depth description, see also Character Sets and Encodings

DESCSET ... (described set of the document base character set)

represents the "described set" of characters (of the base character set) used in a document instance

in the basic SGML declaration above, contains a list of character ranges with the following meaning: 0 9 UNUSED means the character number 0 through 8 (9 characters) are unused, ie. asserted not to occur in a document instance; 9 2 9 means the character numbers 9 and 10 (2 characters) should be treated as character number 9 (the tab character), and 10, respectively (the last text token in a described set portion, if it is a number, is interpreted to mean that the described character range is mapped to the range starting at the specified number); and similar for the other described set portions

the majority of characters of the set is described in portion 32 95 32, meaning the character range 32 through 127 (95 characters) are mapped "to themselves" (ie. the range starting at 32)

CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN" (capacity set)

contains a public identifier for the reference capacity set

a capacity set contains upper bounds for global run-time capacities the processing system is expected to arrange for, such as the maximal number of entities declared in a document instance

these parameters can also be declared directly in the SGML declaration body; see below for an example

these parameters are ignored by sgmljs.net SGML but are honored and checked against actual use in document prologs by eg. (Open)SP SGML

SCOPE DOCUMENT (concrete syntax scope)

asserts that the document character set is used both in the prolog as well as in content

for all intents and purposes, SCOPE DOCUMENT is always the used setting for document instances processed with sgmljs.net SGML (SCOPE SYNTAX is only of historic interest)

for an explanation of the concept of a syntax character set (as opposed to the document character set), see below

SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN" (concrete syntax)

is a reference to the syntax character public identifier the content of which is explained below

FEATURES MINIMIZE ... (minimization features)

contains the minimization features asserted to be used by the document instance

note that while SHORTTAG YES is accepted by sgmljs.net SGML, the only form of short tag minimization supported by sgmljs.net SGML is SGML's so-called Null-end tag and NET-enabling Start-tag minimization, and only insofar as it is necessary to support XML-style empty elements

FEATURES LINK ... (link type features)

contains the link type (LPD) features asserted to be used by the document instance

sgmljs.net SGML supports all link types, but at most 2 simultaneously active implicit link types by default

for an explanation of link type processing, see Templating

FEATURES OTHER ... (other features)

of the "other" features, sgmljs.net SGML only supports FORMAL NO and FORMAL YES (and WebSGML's URN YES as explained with other WebSGML additions)

Concrete Syntax

The concrete syntax fragment (the SYNTAX ... portion as shown above) references a public identifier which acts as if it contained the following code text (which could also be pasted verbatim in place of the concrete syntax fragment above for the same effect):

SYNTAX
	SHUNCHAR
		CONTROLS
		0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
		18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
	BASESET "ISO 646IRV:1991//CHARSET
		 International Reference Version (IRV)//ESC 2/8 4/2"
	DESCSET 0 128 0
FUNCTION
	RE            13
	RS            10
	SPACE         32
	TAB SEPCHAR    9
NAMING  LCNMSTRT ""
	UCNMSTRT ""
	LCNMCHAR ".-_:"
	UCNMCHAR ".-_:"
	NAMECASE GENERAL YES
	         ENTITY  NO
DELIM	GENERAL  SGMLREF
	SHORTREF SGMLREF
NAMES	SGMLREF
QUANTITY SGMLREF
SYNTAX SHUNCHAR ...

contains a list of shunned characters; for the purpose of this exposition, these are the same as those marked UNUSED in the described syntax character set

the set of shunned characters includes the IRV/ASCII control characters (CONTROLS)

SYNTAX BASESET .../DESCSET ...

contains the syntax-reference character set (the character set used to describe the concrete syntax); the general construction of the described set from character ranges in the base set is analogous to that of the document character set

SYNTAX FUNCTION ...

contains assignments of SGML delimiter function roles to characters

SYNTAX NAMING ...

defines the characters accepted for name tokens and the rules for case-folding

NAMECASE GENERAL YES ENTITIES NO activates SGML's traditional case-folding behaviour (namely that elements, attributes, and all other name tokens except entity names, for the purpose of validation and tag inference, are treated as if specified in uppercase letters, even if specified in lower- or mixed case in content)

SYNTAX DELIM GENERAL ...

contains assignments of characters to delimiter roles such as needed for the < character to be interpreted as a STAGO (start-element tag open) delimiter

SGMLREF selects the standard delimiters (which assign the STAGO delimiter as described and expected in most markup language)

SYNTAX DELIM SHORTREF

contains assignments of characters to shortref delimiter roles; note that sgmljs.net SGML doesn`t support user-definable shortref delimiters

SGMLREF selects the standard shortref delimiters

SYNTAX NAMES SGMLREF

asserts that the standard reserved keywords (such as DOCTYPE, ELEMENT) are used in markup declarations

only SGMLREF as specified here is supported in sgmljs.net SGML

SYNTAX QUANTITY ...

contains declarations of upper bounds for certain quantities asserted by a document instance, such as the maximal number of attributes declared on an element

these parameters are ignored by sgmljs.net SGML, but are honored and checked against actual use in document prologs by eg. (Open)SP SGML

WebSGML extensions WebSGML

WebSGML delimiters

WebSGML extends the syntax for the delimiter section in the SGML declaration (ie. adds the HCRO and NESTC delimiters):

DELIM	GENERAL  SGMLREF
	HCRO     "&#38;#x" -- ampersand --
	NESTC    "/"
	NET      ">"
	...
DELIM GENERAL HCRO

the HCRO delimiter is used in numeric character references to indicate that the number portion is interpreted as hexadecimal rather than decimal literal (and to allow the letters A through F and a through f to occur in numeric character references); for example, &#xa represents the U+000A LINE FEED character

in sgmljs.net SGML, the HCRO delimiter cannot be redeclared

DELIM GENERAL NESTC

the NESTC delimiter is introduced to capture XML's empty element syntax within SGML's definitorial framework with respect to delimiter characters

while use of the NESTC delimiter nominally depends on the definition of NET delimiter (the null-end tag delimiter), and changing either the declaration for NESTC or NET, or both, is admitted in SGML in general, sgmljs.net SGML requires that the delimiters roles for NESTC and NET, if assigned at all, must match those given above in

For all intents and purposes within sgmljs.net SGML, for processing XML-style empty elements (including bogus XML-like empty elements in HTML), the delimiter section should be treated as an opaque string and must have exactly (up to space characters and comments) the form given above.

Predefined character entities

WebSGML adds a facility to define character entities without using entity declarations as a means to capture XML's and HTML's behaviour in this respect.

For example, the predefined character entities and their represented character numbers for XML are as follows:

ENTITIES
	"amp"  38
        "lt"   60
        "gt"   62
        "quot" 34
        "apos" 39

See the XML declaration for XML for a complete example of an SGML declaration making use of predefined character entities.

Note that ISO 8879 Annex K requires that all mapped-to characters are contained in the syntax-reference character set, not just the document character set.

WebSGML Features

WebSGML's extensions to the FEATURES section (only mentioned as far as supported in sgmljs.net SGML) include

  • unbundling of SHORTTAG minimization features, meaning that certain shorttag minimization features can be switched on individually, rather than just collectively via FEATURES MINIMIZE SHORTTAG YES (which switches on all shorttag minimization features, among them those only used in historic shortform practices)

  • MINIMIZE IMPLYDEF ... options to allow WebSGML to process document instances lacking declarations for elements, attributes, and other components declarable in DTDs (which was generally not possible prior to the Annex K SGML revision)

A WebSGML FEATURES declaration portion can look as follows:

FEATURES

	MINIMIZE
		DATATAG NO
		OMITTAG NO
		RANK    NO
		SHORTTAG
			STARTTAG
				EMPTY    NO
				UNCLOSED NO
				NETENABL IMMEDNET
			ENDTAG
				EMPTY    NO
				UNCLOSED NO
		ATTRIB
			DEFAULT  YES
			OMITNAME NO
			VALUE    NO
		EMPTYNRM YES

	IMPLYDEF
		ATTLIST  YES
		DOCTYPE  NO
		ELEMENT  YES
		ENTITY   NO
		NOTATION YES

	LINK
		SIMPLE   NO
		IMPLICIT NO
		EXPLICIT NO

	OTHER
		CONCUR   NO
		SUBDOC   NO
		FORMAL   NO
		URN      NO
		KEEPRSRE YES
		VALIDITY NOASSERT
		ENTITIES
			REF      ANY
			INTEGRAL YES
FEATURES MINIMIZE SHORTTAG STARTTAG EMPTY NO

asserts that a document instance doesn't make use of empty start-element tags

a value of YES enables use of empty start-element tags as explained in Empty element minimization

FEATURES MINIMIZE SHORTTAG STARTTAG ... UNCLOSED NO

asserts a document instance's use of certain historic shortform syntax

for sgmljs.net SGML, these must be NO

FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET

expresses, together with DELIM GENERAL NESTC and DELIM GENERAL NET as explained above, that an XML-style empty element is recognized as a short form of specifying the equivalent sequence of a start- and an end-element tag

note that, in addition, FEATURES MINIMIZE EMPTYNRM YES must be declared for being able to use XML-style empty-elements for elements with declared content EMPTY

sgmljs.net SGML accepts this setting only in combination with the settings for DELIM GENERAL NESTC and DELIM GENERAL NET as discussed above

FEATURES MINIMIZE SHORRTAG ENDTAG EMPTY NO

asserts that a document instance doesn't make use of empty end-element tags

a value of YES enables use of empty end-element tags as explained in Empty element minimization

FEATURES MINIMIZE SHORRTAG ENDTAG ... UNCLOSED NO

these features (also for supporting historic markup shortform practices) must both have the value NO for sgmljs.net SGML

FEATURES MINIMIZE ATTRIB DEFAULT

expresses whether default values can be omitted in attribute specifications (ie. with the expectation that default values as declared in the attribute declarations are implied)

FEATURES MINIMIZE ATTRIB OMITNAME

expresses whether attribute names (and the VI delimiter) can be omitted in attribute specifications (ie. as in using name tokens for enumerated attributes, provided a name token can be be uniquely identified among those declared on the attributes of an element, including those declared on #ALL elements)

FEATURES MINIMIZE ATTRIB VALUE

expresses whether quotation characters (LIT and LITA delimiters) can be omitted around attribute values consisting entirely of name characters, even on undeclared attributes

FEATURES EMPTYNRM YES/NO

expresses whether elements with declared content EMPTY or implied-EMPTY elements (those having a content reference attribute specified) are allowed to have end-element tags (if YES)

FEATURES IMPLYDEF ATTLIST YES/NO

expresses that it isn't an error to specify undeclared attributes (if YES)

an undeclared attribute is treated as if it were declared CDATA #IMPLIED

FEATURES IMPLYDEF DOCTYPE YES/NO

expresses that it isn't an error if a document type declaration is absent from a document instance (if YES)

if FEATURES IMPLYDEF DOCTYPE YES is declared, and a document type declaration is absent, the document instance is treated as if <!DOCTYPE #IMPLIED SYSTEM> were present; the external subset is retrieved by forming a system identifier from the document element (the first element encountered), subject to SYNTAX NAMECASE GENERAL, then appending .dtd to it, and interpreting the resulting string as system identifier relative to the system identifier of the document instance being processed

note that if FEATURES IMPLYDEF ELEMENT YES is declared, then a document type declaration is also allowed to be absent; but, if in addition, IMPLYDEF DOCTYPE NO is declared, an absent document type declaration is treated as if <!DOCTYPE #IMPLIED> had been specififed (that is, its external subset is assumed to be empty)

FEATURES IMPLYDEF ELEMENT YES/NO

expresses that it isn't an error to use an undeclared element (if YES or ANYOTHER)

moreover, expresses that it isn't an error if a document type isn't present; see above

FEATURES IMPLYDEF ELEMENT YES has the effect that undeclared elements are implied as if declared - O ANY

FEATURES IMPLYDEF ELEMENT ANYOTHER expresses that, in addition, directly nesting undeclared elements isn't intended for the document instance, and has the effect that an end-element tag (closing the open element) before a start-element tag is inferred, if the element beginning with the start-element would otherwise be treated as direct child content of an element with the same element name

FEATURES IMPLYDEF ENTITY YES/NO

expresses that an entity reference for an undeclared entity is treated as if it were declared system-specific (ie. declared <!ENTITY ... SYSTEM>)

the data character content of the entity reference, if it is used in a parameter or general entity reference (other than a data text entity reference), is retrieved by interpreting the entity name (subject to SYNTAX NAMECASE ENTITY) as system identifier

FEATURES IMPLYDEF NOTATION YES/NO

expresses that it isn't an error if an undeclared notation is used

ignored by sgmljs.net SGML (notations must always be declared)

FEATURES OTHER URN YES/NO

if FEATURES OTHER FORMAL YES is declared (and if ISO 8870:1986 (WWW) is used as minimum data), only then can FEATURES OTHER URN YES also be used

FEATURES OTHER URN YES enables the URL/URN syntax for public identifiers (as an alternative to the standard formal public identifier syntax)

FEATURES OTHER KEEPRSRE YES/NO

this has the effect that SGML's traditional behaviour with respect to suppression of newlines and space characters is switched off

SGML's traditional behaviour (simplified) is, in text records consisting entirely of a start-element tag, character data, and an end-element tag for the same element as in the start-element tag, not to report initial space and trailing space characters, including the trailing RE (newline) character, as data characters, when the start- and end-element tag is for a declared (rather than included) element (ie. because such space and newline characters are considered insignificant, and present for markup text formatting purposes only, in such a way that every markup element is started on its own line)

sgmljs.net SGML only supports YES as the value for this setting, meaning that all space and newline characters will always be reported as character data (note that behaviour for supporting KEEPRSRE NO isn't specified for undeclared elements in WebSGML)

FEATURES VALIDITY TYPE/NOASSERT

asserts that the document is considered valid with respect to the notions expressed next and is used to indicate that validation with respect to the desired validitation level should be performed by sgmljs.net SGML

VALIDITY NOASSERT means that no content model validation, but only balancedness-checking (appealing insofar to XML's wellformedness criteria) is performed; VALIDITY TYPE means that regular validation and tag inference is performed

FEATURES OTHER ENTITIES ...

asserts certain characteristics of a document's use of entities appealing to notions introduced with XML

ignored by sgmljs.net SGML

Character Sets and Encodings

A character set, in SGML, is a mapping of character numbers to characters. A character can be referred to by a character name such as those used by ISO 10646 (aka. Unicode); for example the character rendered in this text just here: &, can be referred to as the character named AMPERSAND.

Having a name, a character, in SGML and also ISO 10646, is considered a concept existing independently of its conventional graphical rendering and of its character number in a particular character set. But in order to refer to a character, a character number and thus, a definition of a character set must be established in a context where it's impractical to refer to characters by their names.

Hence, while SGML conceptually defines an abstract syntax as a mapping of characters (rather than character numbers) to markup delimiters and other function roles, an abstract syntax can only be expressed as a mapping from character numbers in a character set to markup delimiters and other character roles.

SGML uses the term syntax-reference character set to refer to the character set used in an SGML declaration body for assigning meaning to characters as SGML markup delimiters and other character function roles. Furthermore SGML uses the term concrete syntax to refer to the larger portion of an SGML declaration which contains the mapping, including the declaration of the syntax-reference character set it is using.

Given a concrete syntax, an SGML parser is supposed to assess the characters represented by input character data (using the document character set), then assess whether the concrete syntax defines a delimiter or other role to it (depending on context). For the latter, the SGML parser must map a character presented to it in the document character set to the equivalent character in the syntax-reference character set. The SGML declaration itself doesn't contain a mapping between character sets, hence the SGML parser must rely on build-in character set infomation available to it.

Thus, even if the syntax-reference base character set can be theoretically different from the document base character set (unless if the concrete syntax is embedded in the document instance itself, see below), an SGML parser must still be able to establish a mapping for all characters in the document base character set to a character in the syntax-reference base character set.

SGML was originally devised at a time when a generally accepted character set wasn't yet established for referring to characters. Today, of course, the Universal Coded Character Set (UCS, defined in ISO/IEC 10646, and also known as Unicode) is used for this purpose. Since ISO/IEC 10646 contains over 120.000 code points (character numbers), if it is used as a document instance's base character set (which it should), there just doesn't exist a character set other than UCS itself with the same coverage. For this reason, the distinction between document and syntax-reference character set is irrelevant in practice, but nevertheless requisite knowledge to explain the character set notions in the SGML declaration.

Nevertheless, some of the concepts related to constructing a customized described set by remapping UCS character planes or communicating the purpose of private-use characters can be useful for special applications (ie. precisely because of its coverage, merely specifying UCS as document character set isn't helpful in communicating which character ranges are actually used in a document or required for a particular application, font face or variant, printer or other equipment,, vertical, etc.). Note sgmljs.net SGML doesn't include, however, integrated facilities for checking and/or remapping in regular builds.

For all intents and purposes within sgmljs.net SGML, the Universal Coded Character set is used as a base character set for both the document as well as syntax-reference character set.

In the basic SGML declaration above, the International Reference Version character set is used (which is the only character set supported by regular sgmljs.net SGML builds in addition to UCS). International Reference Version or IRV is the term used in international specifications to refer to the ISO/IEC 646 character set known as US-ASCII (technically, the version referenced by ISO/IEC:1983 differs from US-ASCII, and from that referenced by ISO 646IRV:1991, but not in a way relevant to SGML). IRV contains the first 128 code points of UCS, which uses the usual encoding of the US-ASCII character set into bytes interpreted as binary numbers for its character numbers.

Document encoding

A character number is different from a representation of a character as a bit pattern within a particular character encoding such as UTF-8 (even though a character number can be algorithmically determined from an UTF-8 representation); rather, character numbers and character sets are purely organizational concepts to identify and otherwise refer to characters in general.

For being able to read an SGML declaration, of course, an SGML parser must be able to interpret the bytes of an entity according to an encoding of a character set. The character set encoding of a document instance can't be meaningfully stated in its SGML declaration (if it has one), if the SGML declaration is part of the document instance itself (ie. because the SGML declaration must use the same encoding as the document instance, hence the processor still needs additional out of band information with respect to the encoding, else wouldn't be able to read the SGML declaration).

Having to deal with SGML declarations, which are a somewhat archaic, but in any case inconvenient format for conveying processing parameters to an SGML processor, only to find out that such a basic fact about a document instance as its character encoding can't effectively be expressed using it is considered unfortunate. Moreover, having to resort to out-of-band information such as command line processing options or similar in order to being able to parse a document is considered inadequate for SGML, especially with respect to SGML's attractiveness for archival purposes where it is deemed desirable to manifest a document character encoding.

Within established SGML technology, there are the following plausible mechanisms to inform the SGML parser about the character encoding used by a document instance and of bootstrapping an SGML parser into applying a desired character decoding:

  • SGML itself normatively references ISO 2012 code switching techniques as code extension facility; using this mechanism allows an SGML processor to start out in a mode where it accepts only IRV/ASCII characters, and then (virtually) "switches" into the desired mode of accepting eg. an UTF-8 encoding of UCS, based on the designating sequence of the public identifier of the document's base character set (subject to the ISO 8879 provisions with respect to delimiter recognition, this can also be extended to other multi-byte encodings as well)

  • using a wrapper document instance and refer to a main document instance via an entity reference; the reference is declared as an external entity using a formal system identifier which admits additional metadata such as character encoding and similar; (eg. cf. the bctf parameter of ISO 10744 extended facilities eg. FSIDR); this technique can be used with eg. (Open)SP, but isn't supported with sgmljs.net SGML right now even though sgmljs.net SGML supports ISO 10744 FSIDR in general

  • using an SGML catalog, which can associate an SGML declaration to a document instance without having to place an SGML declaration or declaration reference in a document instance)

sgmljs.net SGML only supports the first mechanism of the discussed techniques, and only for UTF-8; the alternatives are discussed for information only.

Note since sgmljs.net SGML uses ISO 2012 to learn about the character encoding of a document, the listing of supported character sets given below includes designating sequences which represent a UTF-8 character encodings.

Note when a character encoding is changed, this has no bearing on the character set, ie. the character numbers used in numeric character references; this is apparent eg. with HTML, which even when served over HTTP with ISO-8859 encoding (which used to be the standard encoding before HTML5) can contain numeric character references that still will be interpreted as UCS code points.

For an overview of ISO 2022, please refer to ECMA-35 Character Code Structure and Extension Techniques, which is identical to ISO/IEC 2022:1994 and made available by ECMA International for public access.

Base Character Set

The following public identifiers are recognized character sets by sgmljs.net SGML:

  • ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0

  • ISO 646IRV:1991//CHARSET International Reference Version (IRV)//ESC 2/8 4/2

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UTF-8 Level 1//ESC 2/5 4/7

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UTF-8 Level 2//ESC 2/5 4/8

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UTF-8 Level 3//ESC 2/5 4/9

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UCS with implementation Level 3//ESC 2/5 2/15 4/6

  • ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6

Note that even though UCS-4, as used in the last public identifier/designation sequence in the list, denotes an alternate UCS encoding, this particular public identifier is interpreted to denote just the UCS character set, and acts exactly the same as the UTF-8 designation sequences.

Syntax-Reference Character Set

The document character set is the term used by SGML to refer to the character set used by a document instance.

The syntax-reference character set is the character set used for an SGML concrete syntax declaration. As shown in the basic declaration for SGML, the concrete syntax fragment can conceptually (but not actually) be stored in another entity and then referenced from the SGML declaration.

As also discussed in the introduction, hence a concrete syntax needs it's own character set definition, independent of the document character set used by a document instance referencing the concrete syntax.

If a concrete syntax definition isn't referenced via a public identifier, but is presented embedded in the SGML declaration code text of the document itself, then it of course must be using the same character set as the document character set of the document which it is part of.

For all intents and purpose, a character number as used in sgmljs.net SGML is a single UCS (ISO 10646 or, equivalently, Unicode) code point, independently of the document encoding (such as UTF-8) being used. Apart from character numbers in the SGML declaration, UCS code points are also used in character entity references in a document instance.

The SGML declaration code text itself is always using (just) the IRV/ASCII character set, and when referring to a character number, is using either a character literal (when the character number/code point is contained in IRV/ASCII and is a graphic character) or, alternatively, a character number (when it is not, or when the author chooses to use a number rather than a literal for specifying it).

Naming

In SGML, the definition of permitted characters for names and name tokens

  • of generic identifiers of elements, attributes, and notations, and,

  • values of attributes with declared value ENTITY, ID, IDREF, NAME, NMTOKEN, or NOTATION, and the attributes with declared value ENTITIES, IDREFS, NAMES, or NMTOKEN for specifiying multiple space-separated name tokens, and

  • entity names

is controlled uniformly in the NAMING section of the used SGML declaration, meaning the declaration is applicable for all these names and name-like constructs at once.

SGML distinguishes name start characters, which can appear as the first character of a name token, from name characters, which can appear anywhere in a name token. Specifically, the digits can't start a name token. By default, unless more characters are added to the set of name characters or name start characters, respectively, as explained next, the upper and lowercase IRV letters are accepted as name start characters, while the digits are accepted in addition as name characters.

Name tokens can be normalized into an uppercase form for the purpose of validation and tag inference (and output, if any), provided that the mapping can be specified for each character or character range individually (eg. rather than by reference to Unicode case conversion procedures), using the LCNMSTRT, UCNMSTRT, LCNMCHAR, and UCNMCHAR parameters.

These parameters each contain either a quoted parameter literal containing characters (as character literals), or a space-separated list of character numbers or character ranges, and have the following meaning:

LCNMSTRT (lowercase name start characters)

describes lowercase characters used as name start characters in addition to the IRV lowercase and uppercase letters

UCNMSTRT (uppercase name start characters)

describes exactly as many characters as LCNMSTRT, and contains the uppercase letter for the corresponding lowercase letter in LCNMSTRT at the same position

LCNMCHAR (lowercase name characters)

describes lowercase characters used as name characters in addition to the lowercase name start characters in LCNSTRT

UCNMCHAR (uppercase name characters)

describes exactly as many characters as UCNMCHAR and contains the uppercase letter for the corresponding lowercase letter in UCNMCHAR at the same position

The SGML declaration admits case folding/canonicalization to be switched on for these two groups of name tokens individually

  • entities (SYNTAX NAMECASE ENTITY YES/NO)

  • and for all other name token uses (SYNTAX NAMECASE GENERAL YES/NO)

but not for more granular subsets of the other name tokens.

Extended Naming Rules

When extended naming rules are used, as indicated by the "ISO 8879:1986 (ENR)" (or the "ISO 8870:1986 (WWW)") minimum literal/data, the naming section of an SGML declaration can contain the additional NAMESTRT and NAMECHAR parameters.

Moreover, extended naming rules enable character ranges to be used with naming parameters, rather than just lists of individual character numbers.

A naming section making use of extended naming rules can look as follows:

NAMING	LCNMSTRT ""
	UCNMSTRT ""
	NAMESTRT ""
	LCNMCHAR ""    
	UCNMCHAR ""
	NAMECHAR ".-_:"

The effect of using NAMESTRT and NAMECHAR, respectively, is that the declaration is treated as if the value for NAMESTRT had been used in both LCNMSTRT and UCNSTRT; likewise, the NAMECHAR value is interpreted as if the parameter literal had been used in both LCNMCHAR and UCNMCHAR.

When using extended naming, the literals for the LCNMSTRT, USNMSTRT, LCNMCHAR, and UCNMCHAR parameters are left empty in the SGML declaration.

SGML's uppercase bias

Note that SGML has a built-in preference for the uppercase form of characters if NAMECASE GENERAL YES is applied, in that

  • the lowercase and uppercase letters are always considered both name and name start characters (cf. ISO 8879 Clause 189); ie. these cannot be excluded from the set of admissable characters for name tokens at all

  • likewise, the definition of a larger character set for name tokens versus those in the SGML reference concrete syntax and the associated lowercase-to-uppercase mapping rules afforded by the LCNMSTRT, UCNMSTRT, LCNMCHAR, and UCNMCHAR SGML declaration parameters (and NAMESTRT/NAMECHAR introduced with the extended naming rules according to ISO 8879 Annex J) can only contain characters in addition to the IRV/ASCII letters and digits; in particular, for the letters in IRV/ASCII, no customized uppercase letter can be mapped; this is enshrined in ISO 8879 Clause 198, 22 which reads "A character assigned to LCNMCHAR, UCNMCHAR, LCNMSTRT, or UCNMSTRT cannot be an LC Letter, UC Letter, Digit, RE, RS, SPACE, or SEPCHAR"; consequently, a lowercase IRV/ASCII letter is always case-folded with build-in SGML rules when NAMECASE GENERAL YES is effective

SGML's uppercase bias isn't affected by ISO 8879 Annex J, which only alters the rule for the NAMING production so as to allow character ranges instead of just single character specifications, and also adds NAMESTRT and NAMECHR as a short-form, but not essential form of specifying the values of LCNMSTRT, UCNMSTRT, LCNMCHAR, and UCNMCHAR when the upper- and lowercase variants are identical.

Note that uppercase is only conceptually the preferred form, ie. for the purpose of defining SGML validation in the specification text. SGML applications are free to output or otherwise convey markup as they see fit. ISO 8879 doesn't put any constraints on these, nor defines a canonical SGML processing application or API apart from defining the line-oriented ESIS SGML representation used by ISO 8879's test case suite (the Grove in-memory representation of SGML, which in many ways is the predecessor to W3C's DOM API, isn't part of ISO 8879).

Hence, whether uppercase or lowercase is used internally by an SGML processor, or whether the processor makes this distinction at all, doesn't have any consequences for the externally observable behaviour of SGML applications as far as ISO 8879 is concerned. For example, sgmljs.net SGML has an option to output HTML markup in lowercase form, while being able to process SGML with NAMECASE GENERAL YES without restrictions.

SGML declaration for HTML5

The SGML for HTML5, applied by sgmljs.net SGML by default when eg. processing .html files or an entity with text/html media type fetched via HTTP, is explained in detail in the HTML5.1 DTD reference.

SGML declaration for HTML4

As an example for a plausible SGML declaration, the following start of an HTML document contains a variant for the (historic) SGML declaration for HTML 4.0. It differs from the official SGML declaration of HTML 4.01 only in its use of the extended WebSGML FEATURES declaration syntax to match actual HTML usage.

Note the declaration as shown here doesn't declare HTML predefined entities for space reasons, and thus can't be used for HTML content containing entity references; the variant of the SGML declaration for HTML5 for use with the full W3C HTML5.2 DTD does contain these and other declaration, though.

<!SGML "ISO 8879:1986 (WWW)"

  -- based on the SGML declaration for HTML 4.01 --
 
CHARSET
         BASESET   "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000
                ENTCAP          150000

SCOPE    DOCUMENT
SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
	           17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0

         FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"    
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  HCRO     "&#38;#x" -- ampersand --
                  NESTC    "/"
                  NET      ">"
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTCNT   120     -- increased for HTML 5 --
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 150
                  GRPCNT   150     -- increased for HTML 5 --

FEATURES
        MINIMIZE DATATAG  NO
                 OMITTAG  YES
                 RANK     NO
                 SHORTTAG
                          STARTTAG EMPTY    NO
                                   UNCLOSED NO
                                   NETENABL IMMEDNET
                          ENDTAG   EMPTY    NO
                                   UNCLOSED NO
                          ATTRIB   DEFAULT  YES
                                   OMITNAME YES
                                   VALUE    YES
                 EMPTYNRM YES
                 IMPLYDEF ATTLIST  YES
                          DOCTYPE  NO
                          ELEMENT  YES
                          ENTITY   NO
                          NOTATION NO
         LINK
                 SIMPLE   NO
                 IMPLICIT NO
                 EXPLICIT NO
         OTHER
                 CONCUR   NO
                 SUBDOC   NO
                 FORMAL   NO
                 URN      NO
                 KEEPRSRE YES
                 VALIDITY NOASSERT
                 ENTITIES
                          REF      ANY
                          INTEGRAL NO
APPINFO NONE
>
<!DOCTYPE HTML>
...
CHARSET ...

the declaration uses the same CHARSET BASESET character set declaration as the SGML declaration for HTML 4.01; the declaration admits most Unicode characters; in practice, any valid UTF-8 byte sequence in content or attribute values is admitted, but note sgmljs.net doesn't enforce this and will admit any byte in content or attributes, whether it forms part of a valid UTF-8 byte sequence or not (except those having a special delimiter role in SGML such as the < character)

SYNTAX ... BASESET ... DESCSET ...

the declaration restricts generic identifiers (used for element, attribute, notation, declaration set, and entity names) to the IRV (ASCII) characters A through Z, a through z, the decimal digits; in addition, the characters . (dot), - (hyphen), _ (underscore), and : (colon) are accepted as the second and subsequent characters, but not as the first character of generic identifiers (note, however, that the SGML processor doesn't enforce these rules)

FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET,FEATURES MINIMIZE EMPTYNRM YES

the declaration is suited for inclusion of SVG and/or MathML as HTML 5 "foreign elements"; specifically, XML-style empty elements are accepted in SVG and MathML; moreover, HTML 5 "self-closing" tags are accepted in HTML content as well (irrespective of whether those are declared "void" element in the HTML 5 spec); also, declarations for NESTC and NET characters have been declaerd as appropriate for XML; note that XML predefined entities are not declared

IMPLYDEF ELEMENT YES

undeclared elements will be accepted and treated as if they were declared <!ELEMENT elmt - - ANY>; according to this setting, only contents of elements which are declared in a DTD will be validated (but see FEATURES OTHER VALIDITY NOASSERT which effectively switches off any content model validation)

QUANTITY SGMLREF ...

quantities have been adapted so that (Open)SP SGML processing tools can process DTDs for HTML 5; quantity declarations are not required for sgmljs.net SGML

FEATURES OTHER KEEPRSRE YES

newline and carriage return characters will be preserved in content (rather than being interpreted according to SGML rules for omissible whitespace); note that sgmljs.net doesn't support another setting for KEEPRSRE

FEATURES OTHER VALIDITY NOASSERT

no content model or attribute validation is performed; only balancedness of start-element and end-element tags is checked, including checks for elements with declared content EMPTY, which may or may not have end-element tag or a "self-closing" start-element tag

SGML declaration for XML 1.0 (Fourth Ed. or earlier)

The following declaration is applied by default if a file being processed has an .xml suffix, or begins with an XML declaration, or begins with this declaration.

The XML Fifth Ed. and the XML 1.1 specification revisions have extended the set of admissible characters in name tokens to cover allmost all UCS code points, hence the declaration text for these revisions can be much shorter. However, these newer XML specifications are widely considered not representative of actual XML usage, and no official ISO/IEC 8879 SGML declarations for these newer XML versions has been released yet.

For interoperability, only use of the official SGML declaration for XML 1.0 exactly (up to whitespace and comments) as given here is supported, and use of variant declarations is strongly discouraged, until a new offical or at least generally accepted SGML declaration for XML is established.

<!SGML "ISO 8879:1986 (WWW)"

     -- SGML Declaration for XML 1.0 --

     -- from: 
        Final text of revised Web SGML Adaptations Annex (TC2) to ISO 8879:1986
        ISO/IEC JTC1/SC34 N0029: 1998-12-06
        Annex L.2 (informative): SGML Declaration for XML

        changes made to accommodate validation are noted with 'VALID:'
     --

     CHARSET
         BASESET "ISO Registration Number 177//CHARSET
                 ISO/IEC 10646-1:1993 UCS-4 with implementation
                 level 3//ESC 2/5 2/15 4/6"
         DESCSET
                 0        9  UNUSED
                 9        2       9
                11        2  UNUSED
                13        1      13
                14       18  UNUSED
                32       95      32
               127        1  UNUSED
               128       32  UNUSED
               160    55136     160
             55296     2048  UNUSED  -- surrogates --
             57344     8190   57344
             65534        2  UNUSED  -- FFFE and FFFF --
             65536  1048576   65536

     CAPACITY NONE  -- Capacities are not restricted in XML --

     SCOPE DOCUMENT

     SYNTAX
         SHUNCHAR NONE
         BASESET "ISO Registration Number 177//CHARSET
                 ISO/IEC 10646-1:1993 UCS-4 with implementation
                 level 3//ESC 2/5 2/15 4/6"
         DESCSET
             0 1114112 0
         FUNCTION
             RE    13
             RS    10
             SPACE 32
             TAB   SEPCHAR 9
         NAMING
             LCNMSTRT ""
             UCNMSTRT ""
             NAMESTRT
                 58 95 192-214 216-246 248-305 308-318 321-328
                 330-382 384-451 461-496 500-501 506-535 592-680
                 699-705 902 904-906 908 910-929 931-974 976-982
                 986 988 990 992 994-1011 1025-1036 1038-1103
                 1105-1116 1118-1153 1168-1220 1223-1224
                 1227-1228 1232-1259 1262-1269 1272-1273
                 1329-1366 1369 1377-1414 1488-1514 1520-1522
                 1569-1594 1601-1610 1649-1719 1722-1726
                 1728-1742 1744-1747 1749 1765-1766 2309-2361
                 2365 2392-2401 2437-2444 2447-2448 2451-2472
                 2474-2480 2482 2486-2489 2524-2525 2527-2529
                 2544-2545 2565-2570 2575-2576 2579-2600
                 2602-2608 2610-2611 2613-2614 2616-2617
                 2649-2652 2654 2674-2676 2693-2699 2701
                 2703-2705 2707-2728 2730-2736 2738-2739
                 2741-2745 2749 2784 2821-2828 2831-2832
                 2835-2856 2858-2864 2866-2867 2870-2873 2877
                 2908-2909 2911-2913 2949-2954 2958-2960
                 2962-2965 2969-2970 2972 2974-2975 2979-2980
                 2984-2986 2990-2997 2999-3001 3077-3084
                 3086-3088 3090-3112 3114-3123 3125-3129
                 3168-3169 3205-3212 3214-3216 3218-3240
                 3242-3251 3253-3257 3294 3296-3297 3333-3340
                 3342-3344 3346-3368 3370-3385 3424-3425
                 3585-3630 3632 3634-3635 3648-3653 3713-3714
                 3716 3719-3720 3722 3725 3732-3735 3737-3743
                 3745-3747 3749 3751 3754-3755 3757-3758 3760
                 3762-3763 3773 3776-3780 3904-3911 3913-3945
                 4256-4293 4304-4342 4352 4354-4355 4357-4359
                 4361 4363-4364 4366-4370 4412 4414 4416 4428
                 4430 4432 4436-4437 4441 4447-4449 4451 4453
                 4455 4457 4461-4462 4466-4467 4469 4510 4520
                 4523 4526-4527 4535-4536 4538 4540-4546 4587
                 4592 4601 7680-7835 7840-7929 7936-7957
                 7960-7965 7968-8005 8008-8013 8016-8023 8025
                 8027 8029 8031-8061 8064-8116 8118-8124 8126
                 8130-8132 8134-8140 8144-8147 8150-8155
                 8160-8172 8178-8180 8182-8188 8486 8490-8491
                 8494 8576-8578 12295 12321-12329 12353-12436
                 12449-12538 12549-12588 19968-40869 44032-55203

             LCNMCHAR ""
             UCNMCHAR ""
             NAMECHAR
                 45-46 183 720-721 768-837 864-865 903 1155-1158
                 1425-1441 1443-1465 1467-1469 1471 1473-1474
                 1476 1600 1611-1618 1632-1641 1648 1750-1764
                 1767-1768 1770-1773 1776-1785 2305-2307 2364
                 2366-2381 2385-2388 2402-2403 2406-2415
                 2433-2435 2492 2494-2500 2503-2504 2507-2509
                 2519 2530-2531 2534-2543 2562 2620 2622-2626
                 2631-2632 2635-2637 2662-2673 2689-2691 2748
                 2750-2757 2759-2761 2763-2765 2790-2799
                 2817-2819 2876 2878-2883 2887-2888 2891-2893
                 2902-2903 2918-2927 2946-2947 3006-3010
                 3014-3016 3018-3021 3031 3047-3055 3073-3075
                 3134-3140 3142-3144 3146-3149 3157-3158
                 3174-3183 3202-3203 3262-3268 3270-3272
                 3274-3277 3285-3286 3302-3311 3330-3331
                 3390-3395 3398-3400 3402-3405 3415 3430-3439
                 3633 3636-3642 3654-3662 3664-3673 3761
                 3764-3769 3771-3772 3782 3784-3789 3792-3801
                 3864-3865 3872-3881 3893 3895 3897 3902-3903
                 3953-3972 3974-3979 3984-3989 3991 3993-4013
                 4017-4023 4025 8400-8412 8417 12293 12330-12335
                 12337-12341 12441-12442 12445-12446 12540-12542

             NAMECASE
                 GENERAL NO
                 ENTITY  NO
         DELIM
             GENERAL  SGMLREF
             HCRO     "&#38;#x"
                      -- Ampersand followed by "#x" (without quotes) --
             NESTC    "/"
             NET      ">"
             PIC      "?>"
             SHORTREF NONE

         NAMES
             SGMLREF

         QUANTITY
             NONE -- Quantities are not restricted in XML --

         ENTITIES
             "amp"  38
             "lt"   60
             "gt"   62
             "quot" 34
             "apos" 39

     FEATURES
         MINIMIZE
             DATATAG NO
             OMITTAG NO
             RANK    NO
             SHORTTAG
                 STARTTAG
                     EMPTY    NO
                     UNCLOSED NO
                     NETENABL IMMEDNET
                 ENDTAG
                     EMPTY    NO
                     UNCLOSED NO
                 ATTRIB
                     DEFAULT  YES
                     OMITNAME NO
                     VALUE    NO
             EMPTYNRM  YES
             IMPLYDEF
                 ATTLIST  YES
                 DOCTYPE  NO
                 ELEMENT  YES
                 ENTITY   NO
                 NOTATION YES
         LINK
             SIMPLE   NO
             IMPLICIT NO
             EXPLICIT NO
         OTHER
             CONCUR   NO
             SUBDOC   NO
             FORMAL   NO
             URN      NO
             KEEPRSRE YES
             VALIDITY NOASSERT
             ENTITIES
                 REF      ANY
                 INTEGRAL YES

     APPINFO NONE

     SEEALSO "ISO 8879//NOTATION Extensible Markup Language (XML) 1.0//EN"
>
<!DOCTYPE ...>
...
BASESET

see notes above

SYNTAX ... BASESET ... DESCSET ...

irrespective of the range restrictions expressed in the declaration the processor admits all valid XML 1.0 Fifth Edition (or XML 1.1) generic identifiers

This declaration also has the following notable settings:

  • SYNTAX NAMECASE GENERAL NO

  • SYNTAX NAMECASE ENTITY NO

  • SYNTAX FEATURES MINIMIZE OMITTAG NO

  • SYNTAX FEATURES MINIMIZE RANK NO

  • SYNTAX FEATURES MINIMIZE IMPLYDEF DOCTYPE NO

  • SYNTAX FEATURES MINIMIZE IMPLYDEF ELEMENT YES

  • SYNTAX FEATURES MINIMIZE IMPLYDEF ATTLIST YES

  • SYNTAX FEATURES MINIMIZE IMPLYDEF ENTITY NO

  • SYNTAX FEATURES MINIMIZE SHORTTAG ATTRIB OMITNAME NO

  • SYNTAX FEATURES MINIMIZE SHORTTAG ATTRIB VALUES YES

  • FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET

  • FEATURES OTHER VALIDITY NOASSERT

  • FEATURES OTHER KEEPRSRE YES

  • added requirements ISO 8879/NOTATION Extensible Markup Language (XML) 1.0//EN

Default SGML declaration

If processing a file with suffix .sgm, a declaration with the following settings is applied:

  • SYNTAX NAMECASE GENERAL YES

  • SYNTAX NAMECASE ENTITY NO, `

  • FEATURES MINIMIZE RANK YES,

  • FEATURES MINIMIZE OMITTAG NO

  • FEATURES MINIMIZE IMPLYDEF DOCTYPE NO

  • FEATURES MINIMIZE IMLYDEF ELEMENT YES

  • FEATURES MINIMIZE IMPLYDEF ATTLIST YES

  • FEATURES MINIMIZE IMPLYDEF ENTITY NO

  • FEATURES MINIMIZE EMPTYNRM YES

  • FEATURES MINIMIZE SHORTTAG ATTRIB DEFAULT YES

  • FEATURES MINIMIZE SHORTTAG ATTRIB OMITNAME NO

  • FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET

  • FEATURES OTHER VALIDITY NOASSERT

  • FEATURES OTHER KEEPRSRE YES

  • FEATURES OTHER FORMAL YES

  • FEATURES OTHER URN YES

In addition, the default SGML declaration has the following link processing related settings:

LINK
	SIMPLE   YES 99
        IMPLICIT YES
	EXPLICIT YES 2

These settings enable link processing and templating.

SGML declaration for markdown

If processing a file with suffix .md, a declaration with the same settings as an .sgm file is applied. In addition,

  • predefined entities for HTML are active

  • the strings <file: , <http:, <mailto:, and <:, #, ##, ###, and others are declared as SHORTREF delimiters (note sgmljs.net doesn't support custom SHORTREF declarations, but the presentation of markdown as a SHORTREF application nominally requires these declarations, even though versions of (Open)SP SGML don't enforce their presence when declaring shortref maps)

Note that, as with .sgm files, validation isn't enabled (left at its default of NOASSERT).

Using a public declaration reference such as the following

<!SGML MARKDOWN PUBLIC "+//IDN sgml.net//SD Markdown//EN">

in place of a full SGML declaration (where the MARKDOWN declaration set name, but not the +//IDN sgml.net//SD Markdown//EN public identifier can be chosen arbitrarily) enables markdown processing from any processed file, not just those ending in .md.