SGML

Syntax Reference

Introduction

SGML (like HTML, which is based on SGML), is a text format starting from the idea of organizing information by tagging or marking up text. SGML is a meta-language for describing markup vocabularies such as HTML and their parsing rules.

Consider the following basic HTML document:

<html>
<head>
	<title>Page Title</title>
</head>
<body>
	<h1>Section Title</h1>
	<p>Body Text with <a href="otherdoc.html">link to another document</a></p>.
	<footer>Page Footer</footer>
</body>
</html>

The element grammar for this document can be described as a SGML Document Type Definition (DTD) as follows:

<!ELEMENT html - - (head?,body)>
<!ELEMENT head - - (title?)
<!ELEMENT title - - (#PCDATA)>
<!ELEMENT body - - (h1,p+)
<!ELEMENT h1 - - (#PCDATA)>
<!ELEMENT p - - (#PCDATA|a)>
<!ELEMENT a - - (#PCDATA)>

In this grammar, the regular expression head?,body means that that the content of the html element is expected to consist of an (optional) head element, followed by a body element, and both the head and the body element have grammar rules for their content, in turn. #PCDATA means that text is expected at the respective position.

Given such a markup grammar and other declarations, SGML can

  • check the markup of a given document or a larger collection of documents, and enforce presence or absence of tags or attributes
  • infer tags and attributes not present in a document but desired for content delivery (as used for automatically adding boilerplate and structuring content in web applications, and to simplify content creation)
  • attach processing to elements or more complex contexts for generating dynamic web content or other template processing application for content production

SGML can be used for

  • content authoring and workflow organization using straightforward concepts such as files and folders, as well as more sophisticated declarative techniques for web applications
  • content delivery over the web, with rich facilities for fetching and preparing content from databases or web services, and for integration into mainstream web application stacks
  • sanitizing potentially malicious user content in dynamic web applications or content production processes (injection prevention).
  • searching, transforming, analyzing and otherwise processing web content and other markup documents.

The following sections describe the markup declarations that can be used in a DTD, and their effect on the respective markup constructs in document content.

Elements

The general form of an element declaration is

<!ELEMENT element-name [rank] [tag-omission-rules] content [exceptions]>

or

<!ELEMENT name-group [rank] [tag-omission-rules] content [exceptions]>

where

element-name
is a single element name to declare

see basic element declaration examples

name-group
is a list of element names to declare

an element list has the form (element1|element2|...|elementN).

see element declaration examples using name groups

rank (optional)

is a non-negative decimal number which is treated as a rank suffix that the declared element must have when used in content

the element name is treated as a rank stem, rather than a complete element name, if a rank suffix is specified

an element declared with rank having a rank suffix specified in content (ie. ending in a number in a start-element tag), sets the implied rank suffix for any element tag in subsequent content

for an element declared with rank having its rank suffix omitted in content, the effective rank suffix is that of the most recent element declared in the same declaration that has a rank suffix specified; the most recent element doesn't necessarily have to be a parent element, but can be any preceding element

it's an error if the first occurrence of an element declared with rank in a document instance has its rank suffix omitted
with respect to rank minimization, sgmljs.net treats all elements declared with the same rank suffix in a DTD as if those were declared in the same declaration; ie. a rank suffix is not only inferred from prior elements declared in the same declaration, but from any prior element having a rank declared and specified in content

an element declared with rank is referenced by its rank stem and rank suffix as concatenated name from other declarations; e.g. an element declared with rank stem abc and rank suffix 3 is referenced as abc3 in content model expressions of other element declarations where the element may occur as content model token

note that using element ranks does not in itself enable uses such as e.g. automatically assigning/incrementing header levels based on tag nesting levels; instead, rank omission always infers from the most recently specified rank: see rank-examples

tag-omission-rules (optional)

- - means both start- and end-tag must be specified

- O means the end-tag can be left out

O - means the start-tag can be left out

O O means both start- and end-tag may be left out

in the above syntax rules O refers to the letter O, and - to the minus character

there must be whitespace between the specifier for start- and end-tag omission rules

the tag-omission-rules specification may be left out altogether in which case they default to - -

see Tag Inference for applying tag omission rules

content

either a Content Model, with surrounding parentheses

or ANY, allowing any content

or EMPTY, which forbids the element to have content

or CDATA, which will make element content parse as character data

or RDATA, which will make element content parse as character data, with general entity references being expanded into the respective entity replacement text

see below for detailed explanation

exceptions

an expression of the form -(exclusions) +(inclusions) where either the exclusions- or the inclusions-part, or both, can be omitted

if both the exclusion- and the inclusion-part is specified, then the inclusion-part must follow the exclusion-part

inclusions is a single element or a name group (a list of elements) allowed to occur anywhere and arbitrarily often in descendant content in addition to elements specified in the content model

exclusions is a single element or a name group of elements not allowed to occur in descendant content, even though allowed by the content model or included by an element declaration for a parent element
if an element is excluded, it can't be included by an element declaration for a descendant element (the inclusion is ignored)
an element that is required in a content model can't be excluded
it's an error for an element to be both excluded and included in the same declaration
if an element occurs at a position where it matches a model group token, and is also in the set of included elements, then it is accepted as content model token (inclusion of the element is ignored)

exceptions can only be specified for elements having a content model or for elements with declared content ANY

See also exception examples.

Declared content

An element declared ANY, EMPTY, or CDATA is said to have declared content.

ANY content

When an element is declared to have ANY content, any content (character data or any nested elements, subject to the effective value of IMPLYDEF ELEMENT in the SGML declaration) may occur between the element start- and end-tag.

EMPTY content

When an element is declared to have EMPTY content, it must be specified

  • either just in start-element tags (ie. end-element tags can't be used for that element at all), or,
  • if EMPTYNRM YES is specified in the SGML declaration, with an optional end-element tag immediately following the start-element tag.

Note that if, in addition to FEATURES MINIMIZE EMPTYNRM YES, also FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET is specified in the SGML declaration, and / and > are declared to have the NESTC and NET delimiter roles, respectively, then any element having no content (regardless of whether the element is declared EMPTY), can be specified as an XML-style empty element, ie. can be abbreviated by <element/>, instead of having to specify <element></element> (see SGML declaration for details).

Note that sgmljs.net supports only the characters stated above for the NESTC and NET delimiter role (or no assignment to these delimiters at all). Moreover, sgmljs.net restricts supported combinations of the FEATURES MINIMITE EMPTYNRM and the FEATURES MINIMIZE SHORRTAG STARTTAG NETENABL IMMEDNET SGML declarations properties to have either the values stated above, or to have both the value NO. The first combination, introduced with the WebSGML revision of the SGML standard, corresponds to modern polyglot markup writing (and is used by default in sgmljs.net), while the latter corresponds to the traditional SGML authoring style.

Note that, if processing XML, empty elements are required to either have end-element tags, or to be specified as XML-style empty elements (ie. as <element/>).

Note that apart from declaring an element to have EMPTY content, an element must also have empty content when a #CONREF attribute is specified on it; see #CONREF in attribute default values.

CDATA content

Elements declared CDATA contain unparsed character data as child content.

The & (ampersand) character has no special meaning in content of elements declared CDATA: character sequences looking like named entity references aren't expanded to replacement text, and are, like character entity references, reproduced as-is to result markup.

A < (lower-than) character terminates content of elements declared CDATA, just like regular elements declared with content models.

Note the CDATA reserved word is also used as declared value of attribute declaration and entity declarations, and as keyword in marked sections.

See declared content examples.

See also EMPTYNRM examples.

Content models

A content model specifies the sequence of sub-elements and/or character data content that an element's child content is expected to have.

It is specified by a content model expression. For example, the content model expression a, b?, c* describes a sequence consisting of a single a element, followed by an optional b element, followed optionally by a sequence of any number of c elements.

A content model expression is an expression constructed from content model tokens and compositors, with optional grouping and nesting of subexpressions in parentheses.

Content model tokens

Content model tokens are either

  • element names declared in the same or another element declaration within the same declaration set, or
  • the #PCDATA token representing character data being allowed at the position in the content model expression where it is specified

Compositors

A compositor is one of the following characters, listed along with the compositor's application to operand elements and/or compound subexpressions, and its semantics:

operand? (zero-or-one compositor)
means "zero or one" of the element or content model subexpression to which it applies
the operand element or content model subexpression to which the compositor applies is written to the left of the compositor
operand* (Kleene star compositor)
means "zero or more" of the element or content model subexpression to which it applies
the operand element or content model subexpression to which the compositor applies is written to the left of the compositor
operand+ (plus compositor)
means "one or more" of the operand element or content model expression to which it applies
the operand element or content model subexpression to which the compositor applies is written to the left of the compositor

an expression such as a+ where a is an element or subexpression, is equivalent to a,a*

left-operand, right-operand (comma compositor)
means "a sequence of the left, followed by the right operand" element name or content model subexpression
(operand) (grouping)
expressions can be grouped in parentheses such that they can be used as operands to higher level compositors; when parentheses are omitted, content model expressions are parsed left-to-right, ie. a compositor to the left of an operand takes precedence over a compositor to the right
left-operand & right-operand (allgroups-compositor)
means any sequence of the operand elements or subexpressions, provided that any one element or subexpression occurs at most once in total
when applied to an operand subexpression that has the "zero-or-one" compositor as top-most compositor, that operand subexpression isn't required to occur, but if it occurs, it must occur at most once anywhere in the content of the element being declared

the content model expression a & b & c is equivalent to the content model expression (a,((b,c)|(c,b)))|(b,(a,c)|(c,a))|(c,(a,b)|(b,a))

Note:

In sgmljs.net SGML, operands of the allgroup compositor must be either

  • an single element name, or
  • a subexpressions having the zero-or-one compositor, the sole operand of which is a single element name.

More complex operands for the allgroup compositor aren't supported.

If #PCDATA is specified as content token, it is implicitly treated as if (#PCDATA)* were specified, ie. parsed character data is always optional in content models.

Content models must be unambiguous, ie. any content token must be uniquely matched without looking ahead at subsequent content tokens for disambiguation. For example, the content model

(a,b)|(a,c)

is not unambiguous, since element a can be matched as the beginning of either (a,b) or (a,c). On the other hand, the equivalent content model expression

a,(b|c)

is unambiguous.

See also compositor parsing precedence examples.

Tag Inference

For automatic generation of required elements not present in content FEATURES MINIMIZE OMITTAG YES must be enabled in the SGML declaration (which it is by default, except when processing XML).

In the following description of SGML tag inference, trivial actions on special conditions aren't described, such as on

  • ANY content models, or, equivalently, implied-ANY elements; implied-ANY elements are elements having child elements with implied element declarations ie. undeclared elements (when allowed to occur via IMPLYDEF ELEMENT YES)

  • EMPTY elements, or, equivalently, implied-EMPTY elements (elements governed by content references)

  • inference of document elements (which is just a special case of general start-tag inference).

See also tag inference examples.

Actions performed on a start-element tag, or on parsed character data

  1. Close definitely completed elements

    Definitely completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration such that only an end-element tag for the enclosing (definitely completed) element is accepted at the context position.

    A model group ending in an optional content token or in a content token with one-or-more compositor can't be definitely completed, and isn't considered for automatic closing.

    It's an error if a definitely completed element's end-tag isn't omissible at this point, because a start-element action cannot be accommodated at the context position.

  2. Check if the start-element tag or parsed character data is accepted at the context position; that is, check it's accepted at the current position in the model group and isn't excluded via exclusion exceptions
  3. Open contextually required elements

    Elements are contextually required if the content model of the enclosing element accepts a single element at the context position as a required element.

    The element to accommodate is not influential in opening elements here, only the state of model group(s) already opened is considered.

  4. If a contextually required element is opened, and matches the content token to accommodate, tag inference is completed for this action

Additional rules

The following actions are performed by sgmljs.net SGML in addition (these and similar recovery actions are also performed by third party SGML parsers such as SP, but are reported as recoverable errors by those parsers, whereas sgmljs.net SGML performs these actions silently):

At step 3, if it isn't possible to open a contextually required element, and

  • the element to accommodate is declared as having rank, and
  • the element's rank suffix to accommodate is higher (numerically larger) than that of the parent element (or the parent has no rank in which case it is be treated as having rank 0), and
  • there's a single transition over a ranked element from the context state, and
  • the rank of that single transitioned-over element is the same as that of the element to accommodate, and
  • the start-tag of the ranked element to transition over is omissible

then that single transitioned-over element is opened as if it were contextually required.

Moreover, if it isn't possible to open either a contextually required element or a rank-implied element as described, the parent element is closed, if it is potentially completed (see definition below).

At step 2, if the element to accomodate isn't accepted at the context position due to exclusion exceptions, close as many potentially completed parent elements as necessary until it is (ie. until no more exclusion exception apply to the element to accomodate, if possible).

Actions performed on an end-element tag

  1. Close potentially completed elements

    Potentially completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration; as opposed to definitely completed elements, the model group may allow further optional elements, or end in a content token (or in a nested model group) having the one-or-more compositor.

  2. If the end-element to accommodate matches the most recently closed element, tag inference is completed for this action

Attributes

Declarations for attribute lists take the form

<!ATTLIST element-name attribute-name declared-value [default-value]
                      [attribute-name declared-value [default-value]] ...>

or

<!ATTLIST name-group attribute-name declared-value [default-value]
                    [attribute-name declared-value [default-value]] ...>

or

<!ATTLIST #ALL attribute-name declared-value [default-value]
                    [attribute-name declared-value [default-value]] ...>

where

element-name
is a single element name to declare attributes for
name-group
is a list of element names to declare attributes for

an element list has the form (element1|element2|...|elementN).

#ALL

declares the attribute on all (declared or undeclared) elements when used in place of element-name or name-group

attribute-name
is the name of the attribute to declare
declared-value
is one of the following possible lexical value types
  • an enumerated value type

  • CDATA, allowing any quoted string to be used as attribute value

  • ENTITY, allowing a name token declared as entity name in the same declaration set; the token doesn't need quoting

  • ENTITIES, allowing, in addition to ENTITY, a space-separated list of name tokens declared as entity names; when actually specifying more than a single entity name in content, the attribute value must be quoted

  • ID, allowing a name token, which must be unique among all name tokens used as ID in a document, and which establishes an ID value for reference by IDREF or IDREFS attribute

  • IDREF, allowing a name token used as ID in the same document; the token doesn't need quoting

  • IDREFS, allowing, in addition to IDREF, a space-separated list of name tokens declared as ID attribute value; when actually specifying more than a single ID value in content, the attribute value must be quoted

  • NAME, allowing a name token; the token doesn't need quoting

  • NAMES, allowing, in addition to NAME, a space-separated list of name tokens

  • NMTOKEN, allowing, in addition to NAME, a token beginning with . (dot), - (minus), or _ (underscore), whereas NAME allows these characters to occur only at the second or subsequent position in the attribute value

  • NMTOKENS, allowing, in addition to NAMES, a list of tokens, each of which beginning with . (dot), - (minus), or _ (underscore)

  • NOTATION, allowing the attribute value to have a notation name specified in the enumerated list of permitted notation names

  • NUMBER, allowing a sequence of digits as attribute value

  • NUMBERS, allowing, in addition to NUMBER, a list of numerical values to occur

  • NUTOKEN, allowing a sequence of digits, followed by a sequence of letters (such as 64px)

  • NUTOKENS, allowing, in addition to NUTOKEN, a list of NUTOKEN tokens

A single attribute list declaration can declare one or more attributes for one or more elements (when using the name group declaration variant).

Conversely, attributes of the same element can also be declared in multiple attribute list declarations (from potentially multiple declaration sets). But the same attribute for a given element can be effectively declared at most once in all applicable attribute list declaration for a given element, ie. multiple declarations for the same attribute on a given element aren't rejected, but only the first declaration, in document order (and by extension in the order in which declaration sets are processed) becomes effective while latter declarations are ignored.

See attribute declaration and use examples.

See also element and attribute preemption (redeclaration) examples.

See also NMTOKEN redeclaration examples.

Default value

The default value is either

  • (for enumerated values) one of the enumerated values
  • (for NOTATION attributes) one of the enumerated notation names

  • (for other attributes) an attribute value literal; only needs quotes if the default value isn't a name token
  • the string #REQUIRED, which means the attribute must be specified, and must have a value

  • the string #IMPLIED, which means the attribute doesn't have to be specified (is optional)

  • the string #CONREF, which means that, if the attribute is specified, then the element on which it is specified is treated as if it were declared EMPTY

The string #FIXED may be specified before default values of the first, second, or third form above. When specified, the attribute either must have the default value, or mustn't be used at all on the respective element.

Note that assigning template entities to attributes declared #CONREF can have additional semantics to the effect that the element on which the #CONREF attribute is specified gets replaced by external content.

Enumerated values

An attribute declaration such as

    <!ATTLIST elmt attr (val1|val2|val3) val1>

declares the attribute attr on element elmt.

The attribute can have the value val1, val2, or val3, and its default value (its value when not specified on the element explicitly) is val1.

Element wildcards (WebSGML)

Using the #ALL keyword, it's possible to declare one or more attributes on all elements; depending on whether undeclared elements are allowed (eg. by using IMPLYDEF ELEMENT YES or IMPLYDEF ELEMENT ANYOTHER as explained below), attributes declared in an attribute list declaration with #ALL can also be used on undeclared elements.

An attribute can be declared both in an #ALL attribute list as well as in a regular attribute list for a single element or an element namegroup at the same time. If an attribute is declared both on an individual element and on #ALL elements, its usage must satisfy both declarations.

For example, an attribute can be declared to have an enumerated value in an #ALL attribute list, and can be declared to have a #FIXED value in an attribute list declaration for an individual element. In this way, it's possible to model a common design pattern in DTDs, wherein an attribute declaration can be declared on an individual element in a more specific way than a generic declaration for the attribute in an #ALL attribute declaration, while the generic #ALL declaration still expresses a baseline declaration and common requirement for the attribute's use accross all element used in a document.

It's a design error (and reported by sgmljs.net SGML as attribute validation error on actual attribute use), if an attribute is declared both as an #ALL attribute and as an attribute on an individual element, when the two declarations are not satisfiable simultaneously. For example, a #FIXED value for an attribute declared in an #ALL attribute declaration can't be refined by declaring a different #FIXED value on an individual element for the same attribute.

The order of an #ALL declaration relative to an attribute declaration of an individual element for the same attribute isn't significant and doesn't change the interpretation of attribute declarations. Moreover, #ALL attribute declarations always apply to all elements of the document type and DTD containing the declaration, irrespective of whether element declarations are placed before or after the respective #ALL attribute declaration in document order (or are present at all).

Note sgmljs.net doesn't support WebSGML's other keywords (such as #IMPLICIT) on attribute declarations in place of #ALL. Moreover, #ALL isn't supported for data attributes (ie. attributes of notations; see below).

Notations

A notation, in general SGML terms, is a representation format for data such as the image formats PNG, GIF, or JPEG, or a text format such as TeX for typesetting mathematics.

Notation markup can be used to specify content in a different data representation format than SGML, either embedded in a SGML document, or as a reference to an external resource.

In sgmljs.net, the notation construct is also used to provide custom processing on markup for a broad class of applications such as content formatting and filtering; see templating,

A notation is declared as follows:

<!NOTATION notation-name identifier>

where

notation-name
is the name of the notation to declare
identifier

is the public and/or system identifier for the notation. as used to identify the notation by either the built-in notations (SGML, SQL, SPARQL, etc.), or by external custom notation handlers; see identifiers

Notation attributes can be used to markup a piece of inline text as "in a notation": in the following example, the characters \sqrt{2} are marked up as TeX-formatted math:

<!doctype example [
	<!element example (math)+)
	<!element math CDATA>
	<!attlist math format notation (tex) #implied>
	<!notation tex public "TeX">
]>
<example>
	<math format=tex>\sqrt{2}</math>
</example>

Note this is only an example of how to specify inline notation data; the use of the ad-hoc public identifier TeX here won't cause sgmljs.net SGML to execute TeX instructions.

Note that when using notation attributes, the content restrictions and entity expansion behaviour declared in the element declaration for the element on which it is declared and specified apply unchanged.

The syntax for declaring (and specifying values for) NOTATION declared attributes is very similar to that of enumerated values; see attribute examples.

For using notations with external entities, see entities.

Data attributes

Like elements, notations can have attributes. Data attributes are used to configure properties of external data entities, or of inline notational content; see templating for details.

Data attributes are declared as follows:

<!ATTLIST #NOTATION notation-name attribute-name declared-value default-value
                                 [attribute-name declared-value default-value ...>
  • for data attributes, the same rules as for element attributes apply, with the following exceptions

  • data attributes can't have a declared value of ID, IDREF, IDREFS, NOTATION, ENTITY, or ENTITIES (however, special rules apply for templating)

Entities

An entity, in SGML, is a stream of character data.

An entity declaration introduces a name for an entity for subsequent use in the SGML prolog or in content. Parsed entities (see general entities) are used for entity references, which are replaced by the entity's character data on processing. Unparsed entities (see data entities}(#data-entities)) are used in as values of ENTITY (or ENTITIES) attributes and for [templating, and are processed in an entity type-specific way.

General entities

The purpose of general entities is to reuse some text at multiple places in a document by placing entity references to a shared declared general entity as follows:

<!DOCTYPE doc [
	<!ENTITY text "some <i>reusable</i> text">
	<!ELEMENT doc - - (p+)>
	<!ELEMENT i - - (#PCDATA)>
	<!ELEMENT p - - (#PCDATA|i)>
]>
<doc>
	<p>First use of the "text" entity follows: &text</p>
	<p>Second use of the "text" entity follows: &text</p>
</doc>

In the example, &text is a reference to the previously declared text (general) entity, and will expand to the string some <i>reusable</i> text in place.

Any markup contained in the entity replacement text will be interpreted as if it had been part of the text in which the entity reference is placed. This means that replacement text can contain tags (or any other SGML content construct such as marked sections, processing instructions, etc.). It may also contain further entity references in turn, which will be expanded in place recursively.

However, valid replacement text for an entity must not contain references to the entity being replaced itself (or, transitively, contain an entity reference expanding into a reference to the entity being expanded itself).

General entity references are expanded anywhere in content, attribute specifications, and replacement content of general entities, except in CDATA marked sections, CDATA content, CDATA attributes and data text entities (CDATA entities). General entity references (as opposed to parameter entity references) aren't expanded in markup declarations.

General entities are lazily fetched at the time(s) an entity reference is parsed in content. When processing an entity declaration with replacement text containing references to further entities, no check is performed whether referenced entities are declared and/or accessible. In particular, unlike parameter entities, at declaration time, replacement text for general entities may contain references to other entities that aren't themselves declared (yet).

See also general entity examples.

External general entities

Rather than specifying the replacement text for an entity literally, it's also possible to specify that replacement text should be retrieved from an external resource (such as a file or a HTTP entity) by declaring the entity as follows:

<!ENTITY ent SYSTEM "filename.txt">

where the part beginning with SYSTEM ... (containing a file name in the example) is an identifier.

Data text entities

For entities declared as follows

<!ENTITY ent CDATA "escaped replacement text">

or, equivalently,

<!ENTITY ent SDATA "escaped replacement text">

entity reference are expanded into the respective literal replacement text without further interpretation of the replacement text as markup. If the replacement text contains characters or character sequences that would be interpreted as markup (such as the < or & characters), then those characters will be expanded into a character entity references .

Consequently, general entity references and tags aren't recognized in data text entities; note, however, that the replacement text literal in a data text entity declaration is subject to parameter entity replacement.

In sgmljs.net, SDATA behaves same as CDATA (in legacy SGML processing systems, SDATA, or "specific character data", was used to control SGML output serialization of data text, among other things).

Processing instruction data text entities

Apart from CDATA and SDATA, also the PI keyword can be used in data text entity declarations.

This variant introduces an entity containing a processing instruction, and is the only variant that can also be used with parameter entities.

References to PI data text entities can only be used in a context where a processing instruction can be used; specifically, PI data text general entity references can't be used in attribute values.

External data text entities

In sgmljs.net, an external data text entity is declared using the syntax for CDATA and SDATA data entities, explained below.

Character entity references

Character entity references are strings of the form &#NNNNNN where NNNNNN is a decimal number, or of the form &#xMMMMMM where MMMMMM is a hexadecimal number. The number refers to the code point in the document character set (Unicode) represented by the character entity reference.

Character entity references are passed as-is to the output; all browsers and markup processing tools are expected to be able to handle character entity references.

Parameter entities

Entity declarations with a % character following the ENTITY keyword introduce parameter entities. Where general entity declaration define replacement text for content, parameter entities define replacement text in markup declarations.

For example, the following document type declaration set contains a declaration for the idattr parameter entity. The parameter entity is then referenced twice in further declarations.

<!DOCTYPE doc [
	<!ENTITY % idattr "id ID #IMPLIED">
	<!ELEMENT doc - - (#PCDATA|p|ul|a)>
	<!ELEMENT p - - (#PCDATA)>
	<!ELEMENT ul - - (li+)>
	<!ELEMENT li - - (#PCDATA)>
	<!ELEMENT a - - (#PCDATA)>
	<!ATTLIST doc %idattr>
	<!ATTLIST p %idattr>
	<!ATTLIST ul %idattr>
	<!ATTLIST li %idattr>
	<!ATTLIST a href CDATA #IMPLIED %idattr>
]>
...

Similar to general entity references, the %idaddr parameter entity reference is expanded into the replacement text

id ID #IMPLIED"

so that all elements will have the same id attribute declaration as result.

Furthermore, the a element will have the href attribute in addition to the id attribute. Note that the purpose of reusing an attribute declaration can also be achieved by using a name group - a list of element names - in an ATTLIST declaration (and furthermore could also be achieved using WebSGML's #ALL keyword in place of an element name or name group).

A parameter entity reference must begin with the % character. A parameter entity declaration must have whitespace between the % character and the subsequent parameter entity name.

Apart from reusing parts of declaration text, parameter entities are used in particular for

  • customizing a generic external declaration set by overriding default declarations for parameter entities in the internal declaration set; see declaration sets

  • as placeholder for keywords in marked sections

  • designing declaration set text for reuse in general.

Unlike general entities, parameter entities are fetched eagerly as soon as an external parameter entity declaration is processed. Therefore, it is an error for the replacement text of a parameter entity to contain unresolved references to (other) parameter entities; references to parameter entities already declared in a prior declaration (in markup declaration text order), on the other hand, are recognized and expanded in parameter entity replacement text.

Parameter entities can also be used for fetching external content when external content can't or shouldn't be fetched multiple times (as is the case for external general entities), for example when fetching an external service response into a parameter entity for multiple reference. when fetching from the standard input or a network stream.

Parameter entity references are expanded in the replacement text for general entities (as well as in any other markup declaration); this means that any parameter entity value can be re-declared (copied) as general entity by placing a parameter entity reference into the replacement text for a general entity).

Note that parameter (or general) entity references aren't expanded in system identifier literals (of markup declarations using external identifiers, such as entity and notation declarations). To construct a system identifier from a parameter entity, an additional, derived parameter entity is declared consisting of a reference to the parameter entity to construct from, with leading and trailing quote characters added; the derived parameter entity is then used as system identifier literal.

See also parameter entity examples.

External Parameter Entities

Like general entity declarations, parameter entity declarations can point to a system identifier (a file or network location to fetch character data from), rather than providing inline replacement text as parameter literal.

System-specific Entities

An entity declaration with omitted system identifier literal but containing SYSTEM, such as the following

    <!ENTITY ent SYSTEM>

declares an entity which is resolved by default to the filename ent. The file is searched for in the same directory as the file declaring it (the resolved value or the directory to search can be changed using runtime parameters).

Any entity that can be declared as external entity (general, data and parameter entities) can be declared system-specific.

Implied Entities

When IMPLYDEF ENTITY YES is specified in the SGML declaration, general entity references to undeclared entities will be resolved as system-specific entity. This means there is no need to specify an entity declaration at all; entities can be referenced right away provided the entity name can be resolved as file name, or another resolution rule has been provided as invocation parameter.

Parameter entities, on the other hand, must always be declared. Note, however, that external data text entities can't be declared system-specific.

Data Entities

Entities can be declared to be in a notation as follows (where we first declare a notation to reference its name in the entity declaration):

<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY someent SYSTEM "some-entity" NDATA somenotation>

Entities declared like this are not considered SGML character data and won't be expanded into replacement text when used in an entity reference.

Instead, the SGML processor just reproduces entity reference for these as-is; special processing can be implemented and associated with a notation (ie. with a public identifier of a notation) via notation handlers and the SGML API. A standard notation handler is provided by the templating feature.

Data entities declared using the CDATA or SDATA keywords in place of NDATA, on the other hand, will be expanded into the respective replacement text when used as entity reference:

<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY ent SYSTEM "some-entity" CDATA somenotation>

An entity reference to ent will be expanded into the text contained in the "some-entity" file; as with data text entities, special characters such as < or & are escaped in the replacement text, and not treated as markup delimiters.

See also data entity examples.

Providing values for data attributes

If a notation has data attributes, values for the data attributes can (or must, if no #FIXED or default values are provided) be specified as shown in the following example:

<!NOTATION notation n system "some system id">
<!ATTLIST #NOTATION n x CDATA #IMPLIED y CDATA #IMPLIED>
<!ENTITY e SYSTEM "another system id" NDATA n [ x="val1" y="val2" ]>

where the first two declarations establish a notation with data attributes x and y, and the NDATA entity declaration for the e entity demonstrates the syntax for providing data attribute values.

Marked Sections

Marked sections are for including or ignoring, respectively, a portion of SGML prolog or content, optionally depending on the value of a parameter entity.

For example, the following example contains a marked section around a content portion:

<!DOCTYPE test [
	<!ELEMENT test - - (#PCDATA|a)>
	<!ENTITY % condition "INCLUDE">
]>
<test>
	The following hyperlink is included or
	ignored based on the `condition` parameter
	entity:
	<![ %condition [
		<a>Hyperlink text</a>
	]]>
</test>

The SGML processor will reproduce the <a>Hyperlink text</a> text in its output because the effective value of the %condition parameter entity is INCLUDE; if it were IGNORE instead, the document is treated as if <a>Hyperlink text</a> weren't contained in the document.

Moreover, the document prolog may contain marked sections, too. In the following document, the attribute declaration will be only be applied if the condition parameter entity has the value INCLUDE:

<!DOCTYPE test [
	<!ELEMENT test - - (#PCDATA|a)>
	<!ENTITY % condition "INCLUDE">
	<![ %condition [
		<!ATTLIST test testatt CDATA #REQUIRED>
	]]>
]]>
<test testatt="some text">Some other text</test>

A further use case for marked section (CDATA and RCDATA marked sections) is to prevent interpretation of markup delimiters in portions of text.

A marked section

  • begins with the character sequence <![,

  • followed by one or more marked section keywords,

  • followed by the [ character (possibly with whitespace before and/or after),

  • followed by the marked section text, and
  • closed with the character sequence ]]>.

Keywords have the following meaning:

INCLUDE
means the portion wrapped in the marked section will be included; the marked section effectively is replaced by the wrapped marked section text
IGNORE
means the marked section is ignored, ie. skipped
TEMP

is equivalent to IGNORE; offers a way to mark up editorial content such as author comments without having to use IGNORE

CDATA
the marked section text is interpreted as verbatim text without interpreting markup delimiters and entity references
RCDATA
the marked section text is interpreted as verbatim text without interpreting markup delimiters except general and character entity reference start characters

If no keyword is encountered (ie. if the parameter entity is expanded into blank text or if a construct such as <![[ text ]]> is used), the marked section will be treated as if INCLUDE were specified.

If multiple keywords are encountered (if the parameter entity expands to multiple keywords, or if multiple parameter entities are used, each of which expanding into a keyword), if IGNORE is among them, the marked section is treated as if it were a IGNORE section. That is, IGNORE has highest precedence, followed by TEMP, CDATA, and INCLUDE.

Marked sections other than CDATA and RCATA marked sections can be nested up to four levels (ie. marked section text can contain further marked sections, etc.).

Marked sections can contain any SGML construct valid in the context where the marked section is placed.

Marked sections only apply to top-level SGML constructs, and can't be used within e.g. attributes.

Note that it generally doesn't make sense to create a marked section and use parameter entities to switch parsing behaviours between CDATA and either INCLUDE or IGNORE because of how CDATA marked sections are parsed.

Note that sgmljs.net SGML doesn't support external entities in RCDATA marked sections; as a workaround, it's possible to pull external content into a parameter entity, then reference that parameter entity in the replacement text literal of a general entity, and then reference that general entity in an RCDATA marked section.

See marked section examples.

Identifiers

Literals used for file and other resource names of external entities, declaration sets, notations, or other SGML components are called identifiers in SGML terminology.

Apart from system identifiers which were already used above, SGML also has public identifiers. Public identifiers don't name a physically existing or otherwise accessible resource, but identify a symbolic resource known to the SGML processing system out of band instead of or in addition to a system identifier.

For example, the DTD for HTML 4 (containing declarations for all markup features understood by web browsers up until HTML 5 became generally accepted as standard) can be referenced via the public identifier -//W3C//DTD HTML 4.01//EN without reference to a physical location of a DTD file. Using a public, well-known identifier for this purpose is appropriate since a web browser is usually hard-coded to interpret a particular markup language (such as HTML, SVG, and MathML), and isn't designed to render dynamic markup languages at runtime. Using a system identifier, on the other hand, isn't beneficial here since it would have to be treated as a constant rather than an actually accessible resource by browsers anyway.

Since the introduction of SGML, Uniform Resource Locators (URLs) and variants have become widely used for locating and identifying resources on the web, similar to the purpose of system and public identifiers, respectively. Hence, SGML has been extended to allow the use of URLs as both system and public identifiers. While any URL can be used as system identifier as long as a resource can be located using it, public identifiers also need to include an owner identifier as a prefix which identifies a naming authority and a public text type which identifies the role of the virtual resource identified within a DTD. Therefore, URLs for public identifiers (e.g. for formal public identifiers) are required to have the particular syntax described below.

The following examples show how to declare entities with system identifiers, with public identifiers, and with both public and system identifiers, respectively:

<!ENTITY ent PUBLIC "pubid">
<!ENTITY ent SYSTEM "sysid">
<!ENTITY ent PUBLIC "pubid" "sysid">

Declarations for notations with public, with system, and with both public and system identifiers look very similar:

<!NOTATION n PUBLIC "pubid">
<!NOTATION n SYSTEM "sysid">
<!NOTATION n PUBLIC "pubid" "sysid">

System Identifiers

In most cases, a system identifier is just a path string such as "a/b/c" (using the forward-slash character as separator). Like with URLs used in HTML href or src attributes, the path is resolved relative to the SGML document or DTD from which it is referenced. Hence, a path string can be used both to reference a file (when processing a local SGML file) and a resource accessed via e.g. the HTTP protocol (when accessing a remote SGML document via a network).

Formal System Identifiers

Apart from using just a path, a system identifier can also be

  • a URL (but note that accessing resources via arbitrary URL schemes isn't supported), or
  • a string beginning with <osfile>, followed by a string interpreted as file name; this option is used to override interpretation of the identifier as URL path (such as when interpretation of URL percent-encoding is undesired)

  • a string beginning with <osfd>, followed by a file descriptor number in the range 0-4; the purpose and usage of this syntax is explained in templating.

  • for an entity declaration, a string beginning with <literal>, followed by literal replacement text for the entity; this form of system identifier is functionally equivalent to using an inline parameter literal as replacement text in an entity declaration

    For example, the following declarations result in general entities expanding to the same value:

    <!entity e "replacement text">
    <!entity f system "<literal>replacement text">
    

Note: the syntax for formal system identifiers was introduced with the HyTime standard, 2nd Ed., and is also supported by other SGML processing systems. It models system identifiers as pseudo-elements and extends well to further system identifier use case/features (e.g. by allowing pseudo-attributes).

Note that formal system identifiers might not be supported in all system identifier roles.

Public Identifiers

A public identifier is a sequence of the ASCII characters A through Z, a through z, the decimal digits, the characters (, ), +, , (comma), . (dot), /, :, =, ?, -, and the space, newline and carriage-return characters.

Formal Public Identifiers

If FORMAL YES is specified in the SGML declaration, a public identifier must have the following syntax:

  1. Owner identifier

    • Either the string "ISO" followed by a string made of digits and : (colon) characters, followed by the string //

    • or the characters + or - followed by the string //, followed by a string not containing the / (slash) character, followed by the string //

  2. Public text class

    One of the following strings, directly following the preceding string //:

    • CAPACITY

    • CHARSET

    • DOCUMENT

    • DTD

    • ELEMENTS

    • ENTITIES

    • LPD

    • NONSGML

    • NOTATION

    • SHORTREF

    • SUBDOC

    • SYNTAX

    • TEXT

  3. Public text description

    A string of characters not containing the / (slash) character

  4. The string //

  5. Public text designating sequence (for CHARSET public text) or public text language (for other public text classes)

    A string of characters not containing the / (slash) character;

  6. The string //

  7. Public text display version

    A string of characters not containing the / (slash) character

Except for CHARSET public text, the components following public text description are optional; if the optional components are omitted, the public identifier ends in the public text description.

The public text display version is optional; if the public text display version is omitted, the public identifier ends in the public text designating sequence or public text language.

As examples for public identifiers, here are the public identifiers of HTML 4 and SGML, respectively:

-//W3C//DTD HTML 4.01//EN

ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)

URN syntax for formal public identifiers

If URN YES is specified in the SGML declaration, any public identifier in a markup declaration can also be declared using an alternative URL syntax (in addition to the standard syntax for public identifiers when FORMAL YES is specified).

Examples for the public identifiers in URN syntax corresponding to those in standard public identifier syntax above are as follows:

urn:publicid:-:W3C:DTD+HTML+4.01:EN

urn:publicid:ISO+8879%3A1986:NOTATION+Standard+Generalized+Markup+Language+(SGML)

Declaration Sets

Each markup declaration is part of a declaration set. A declaration is either a document type declaration set or a link process declaration set (until now we have only considered document type declaration sets, see Templating Reference for link process declaration sets).

Any SGML document prolog consists of one or more named declaration sets, as in the following example:

    <!DOCTYPE D [
        ... markup declarations ...
    ]>
    <!DOCTYPE E [
        ... markup declarations ...
    ]>
    ... document content ...

Note that standard SGML only allows multiple DTDs to occur if either the CONCUR YES or LINK IMPLICIT YES or LINK EXPLICIT YES n features are active in the SGML declaration.

Internal and external subset

Via parameter entities, a declaration subset can reference markup declarations stored in other text files or external resources. In the following example, markup declarations from the e parameter entity, declared to contain the content of external-declarations.dtd, are included in the DTD:

<!doctype D {
	... other declarations ...

	<!entity % e "external-declarations.dtd">
	%e
]>
... document content ...

The following is an alternative syntax for achieving the same:

<!doctype D system "external-declarations.dtd" {
	... other declarations ...
]>
... document content ...

An identifier specification such as system "external-declarations.dtd" following after <!doctype D is called the external subset identifier and is interpreted by SGML such that the markup declarations located or identified by it are included at the end of the declaration set being declared.

The set of markup declarations introduced via an external subset identifier are called the external subset, as opposed to the internal subset which is the set of markup declaration that are contained in the [ and ] delimiters of a DTD (or LPD).

As a consequence of the external subset being processed after the internal subset, the internal subset can preempt ("override") entity declarations (but not other markup declarations) in the external subset. In the following example, the x entity is declared both in the internal-subset-preemption-example.sgm document and in external-declarations.dtd:

<!-- external-declarations.dtd -->
<!entity x "This">

<!-- internal-subset-preemption-example.sgm -->
<!doctype d system "external-declarations.dtd" [
  <!entity x "That">
]>
<d>&x</d>

The declaration in the internal subset gets processed first and sets the replacement text value for x (to That); the declaration for x in external-declarations.dtd is ignored, because a declaration for x is already established when external-declarations.dtd is processed.

Note that the term "parameter entity" is due to this feature; it emphasizes that the internal subset "parametrizes" or "configures" external subset defaults such that settings more specific to the document instance apply.

SGML allows the external subset to be specified by any kind of identifier, ie. allows it to be specified as a public identifier (or as both a system and public identifier), but sgmljs.net SGML can't resolve public identifiers for external subsets except for the HTML DTD described in SGML Web Reference and requires always a system identifier otherwise.

Though, technically, the result of using an external subset specification is the same as that of using an explicit parameter entity declaration and reference as in the initial example, applications may interpret the syntactic representation as an external subset identifier special; for example, in (the "lax" variant of) templating, only the formal external subset identifier, rather than merely an identifier for a named parameter entity, establishes eligibility of document fragments for inclusion into master documents.

SGML declaration

Using an optional SGML declaration, it's possible to specify general properties of a document instance such as its character set, the characters used for markup delimiters, whether, and which, markup minimization features such as tag omission are used, among other things.

An SGML declaration body is a piece of plain text (as described in detail further below) contained either directly in a document instance (in an SGML declaration as the begin of the document instance) or stored in an external entity and referenced via an identifier in an SGML declaration reference at the begin of the document instance.

A conformant SGML processor isn't required to be able to process an SGML declaration; if it isn't, the information contained in an SGML declaration is provided for manual inspection and comparison against the SGML declaration(s) and features supported by the processing system.

sgmljs.net SGML, as much as possible, is designed to avoid the necessity of having to bother with SGML declarations, by

  • inferring applicable SGML declarations from file name suffixes or other out-of-band information such as HTTP/IANA media types
  • supporting the use of SGML declaration references in place of full SGML declarations (as described below)
  • allowing XML declarations to act as SGML declaration.

Note certain sgmljs.net SGML tools/builds lack support for parsing SGML declarations alltogether.

Basic SGML declaration shows the begin of a document instance with the traditional basic SGML declaration, asserting, to the processor, that the document instance is using the reference concrete syntax and other basic settings.

sgmljs.net SGML, while accepting the Basic SGML declaration below, doesn't support all features requested in this declaration (namely certain markup minimization forms requested with SHORTTAG YES mostly interesting from a historical perspective) and can't claim full conformance insofar as support for these legacy features is mandated for conformance.

For sgmljs.net SGML, the preferred SGML declaration syntax to use is the one introduced with the WebSGML (ISO 8879:1986 Annex K) revision of SGML as explained further below, which can express presence or absence of legacy features on a more granular level, and hence can more readily represent sgmljs.net SGML's feature set.

The Annex K revision of SGML has extended both the SGML declaration syntax as well as that of markup declarations; use of any WebSGML extension is indicated by using "ISO 8870:1986 (WWW)" as minimum data/literal for the initial part of an SGML declaration body. The WebSGML additions include essential changes for parsing XML and HTML. In sgmljs.net SGML, WebSGML additions are available even in the absence of an SGML declarations.

Note that SGML declaration settings are only discussed insofar as they are supported by sgmljs.net SGML:

  • sgmljs.net SGML is designed to accept markup in the reference concrete syntax (with supported WebSGML additions to cover XML and HTML as explained below), which convers basically all angle-bracket markup languages, including the fundamental syntax used of XML and HTML

  • while the SGML declaration in principle allows redefinition of function characters and delimiters (such as the < character), and of reserved names, this isn't supported by sgmljs.net SGML

  • UTF-8-encoded document instances are the only consistently supported across all regular sgmljs.net SGML builds/targets.

Basic SGML declaration

Note that an SGML declaration is rarely used in the basic form given below even in English-speaking countries because of it's restriction of the set of usable characters in the document instance to just the IRV/ASCII characters.

Whitespace (space, tab, and newline characters), as well as SGML comments (text between -- character sequences) isn't significant in SGML declarations and only provided for formatting (in the sense that any whitespace sequence can be replaced by a single space character).

<!SGML "ISO 8879:1986"
	CHARSET
		BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0"
		DESCSET
			0 9 UNUSED
			9 2 9
			11 2 UNUSED
			13 1 13
			14 18 UNUSED
			32 95 32
			127 1 UNUSED
	CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
	SCOPE DOCUMENT
	SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
	FEATURES
		MINIMIZE
			DATATAG NO
			OMITTAG YES
			RANK NO
			SHORTTAG YES
		LINK
			SIMPLE NO
			IMPLICIT NO
			EXPLICIT NO
		OTHER
			CONCUR NO
			SUBDOC NO
			FORMAL NO
	APPINFO NONE>
<!-- document prolog and content following here ... -->
"ISO 8879:1986" (minimum data)
indicates the SGML declaration syntax revision being used

sgmljs.net SGML supports all released revisions (ie. also "ISO 8879:1986 (ENR)" or "ISO 8870:1986 (WWW)" in addition to "ISO 8879:1986")

the "ISO 8879:1986 (ENR)" minimum literal asserts that the extensions to the SGML declaration syntax introduced with ISO 8879 Annex J can be used see Extended naming rules

"ISO 8879:1986 (WWW)" asserts that, in addition, those from ISO 8870 Annex K can be used; see WebSGML

for sgmljs.net SGML, use of "ISO 8879:1986 (WWW)" is always recommended

CHARSET BASESET ... (document base character set)
this asserts that IRV/ASCII character set is used in the document instance as base character set

the literal is a formal public identifier containing the text designation sequence ESC 2/5 4/0 representing the escape sequence to (virtually) enable to the IRV coding system

SGML uses the escape sequences registered with the International register of character sets to be used with escape sequences (which complies with the ISO/IEC 2012:1986 and ISO/IEC 2012:1994, respectively) to identify character sets

sgmljs.net SGML recognizes the public identifiers for character sets as listed in Base Character Set

for an in-depth description, see also Character Sets and Encodings

DESCSET ... (described set of the document base character set)
represents the "described set" of characters (of the base character set) used in a document instance

in the basic SGML declaration above, contains a list of character ranges with the following meaning: 0 9 UNUSED means the character number 0 through 8 (9 characters) are unused, ie. asserted not to occur in a document instance; 9 2 9 means the character numbers 9 and 10 (2 characters) should be treated as character number 9 (the tab character), and 10, respectively (the last text token in a described set portion, if it is a number, is interpreted to mean that the described character range is mapped to the range starting at the specified number); and similar for the other described set portions

the majority of characters of the set is described in portion 32 95 32, meaning the character range 32 through 127 (95 characters) are mapped "to themselves" (ie. the range starting at 32)

CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN" (capacity set)
contains a public identifier for the reference capacity set
a capacity set contains upper bounds for global run-time capacities the processing system is expected to arrange for, such as the maximal number of entities declared in a document instance
these parameters can also be declared directly in the SGML declaration body; see below for an example
these parameters are ignored by sgmljs.net SGML but are honored and checked against actual use in document prologs by eg. (Open)SP SGML
SCOPE DOCUMENT (concrete syntax scope)
asserts that the document character set is used both in the prolog as well as in content

for all intents and purposes, SCOPE DOCUMENT is always the used setting for document instances processed with sgmljs.net SGML (SCOPE SYNTAX is only of historic interest)

for an explanation of the concept of a syntax character set (as opposed to the document character set), see below
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN" (concrete syntax)
is a reference to the syntax character public identifier the content of which is explained below
FEATURES MINIMIZE ... (minimization features)
contains the minimization features asserted to be used by the document instance

note that while SHORTTAG YES is accepted by sgmljs.net SGML, the only form of short tag minimization supported by sgmljs.net SGML is SGML's so-called Null-end tag and NET-enabling Start-tag minimization, and only insofar as it is necessary to support XML-style empty elements

FEATURES LINK ... (link type features)
contains the link type (LPD) features asserted to be used by the document instance
sgmljs.net SGML supports all link types, but at most 2 simultaneously active implicit link types by default

for an explanation of link type processing, see Templating

FEATURES OTHER ... (other features)

of the "other" features, sgmljs.net SGML only supports FORMAL NO and FORMAL YES (and WebSGML's URN YES as explained with other WebSGML additions)

Concrete Syntax

The concrete syntax fragment (the SYNTAX ... portion as shown above) references a public identifier which acts as if it contained the following code text (which could also be pasted verbatim in place of the concrete syntax fragment above for the same effect):

SYNTAX
	SHUNCHAR
		CONTROLS
		0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
		18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
	BASESET "ISO 646IRV:1991//CHARSET
		 International Reference Version (IRV)//ESC 2/8 4/2"
	DESCSET 0 128 0
FUNCTION
	RE            13
	RS            10
	SPACE         32
	TAB SEPCHAR    9
NAMING  LCNMSTRT ""
	UCNMSTRT ""
	LCNMCHAR ".-_:"
	UCNMCHAR ".-_:"
	NAMECASE GENERAL YES
	         ENTITY  NO
DELIM	GENERAL  SGMLREF
	SHORTREF SGMLREF
NAMES	SGMLREF
QUANTITY SGMLREF
SYNTAX SHUNCHAR ...

contains a list of shunned characters; for the purpose of this exposition, these are the same as those marked UNUSED in the described syntax character set

the set of shunned characters includes the IRV/ASCII control characters (CONTROLS)

SYNTAX BASESET .../DESCSET ...

contains the syntax-reference character set (the character set used to describe the concrete syntax); the general construction of the described set from character ranges in the base set is analogous to that of the document character set

SYNTAX FUNCTION ...
contains assignments of SGML delimiter function roles to a characters
SYNTAX NAMING ...

defines the characters accepted for name tokens and the rules for case-folding

NAMECASE GENERAL YES ENTITIES NO activates SGML's traditional case-folding behaviour (namely that elements, attributes, and all other name tokens except entity names, for the purpose of validation and tag inference, are treated as if specified in uppercase letters, even if specified in lower- or mixed case in content)

see also below
SYNTAX DELIM GENERAL ...

contains assignments of characters to delimiter roles such as needed for the < character to be interpreted as a STAGO (start-tag open) delimiter

SGMLREF selects the standard delimiters (which assign the STAGO delimiter as described and expected in most markup language)

SGMLREF is assumed

SYNTAX DELIM SHORTREF
contains assignments of characters to shortref delimiter roles; note that sgmljs.net SGML doesn`t support user-definable shortref delimiters

SGMLREF selects the standard shortref delimiters

SYNTAX NAMES SGMLREF

asserts that the standard reserved keywords (such as DOCTYPE, ELEMENT) are used in markup declarations

only SGMLREF as specified here is supported in sgmljs.net SGML

SYNTAX QUANTITY ...
contains declarations of upper bounds for certain quantities asserted by a document instance, such as the maximal number of attributes declared on an element
these parameters are ignored by sgmljs.net SGML, but are honored and checked against actual use in document prologs by eg. (Open)SP SGML

WebSGML extensions

WebSGML delimiters

WebSGML extends the syntax for the delimiter section in the SGML declaration (ie. adds the HCRO and NESTC delimiters):

DELIM	GENERAL  SGMLREF
	HCRO     "&#38;#x" -- ampersand --
	NESTC    "/"
	NET      ">"
	...
DELIM GENERAL HCRO

the HCRO delimiter is used in numeric character references to indicate that the number portion is interpreted as hexadecimal rather than decimal literal (and to allow the letters A through F and a through f to occur in numeric character references); for example, &#xa represents the U+000A LINE FEED character

in sgmljs.net SGML, the HCRO delimiter cannot be redeclared

DELIM GENERAL NESTC

the NESTC delimiter is introduced to capture XML's empty element syntax within SGML's definitorial framework with respect to delimiter characters

while use of the NESTC delimiter nominally depends on the definition of NET delimiter (the null-end tag delimiter), and changing either the declaration for NESTC or NET, or both, is admitted in SGML in general, sgmljs.net SGML requires that the delimiters roles for NESTC and NET, if assigned at all, must match those given above in

For all intents and purposes within sgmljs.net SGML, for processing XML-style empty elements (including bogus XML-like empty elements in HTML), the delimiter section should be treated as an opaque string and must have exactly (up to space characters and comments) the form given above.

Predefined character entities

WebSGML adds a facility to define character entities without using entity declarations as a means to capture XML's and HTML's behaviour in this respect.

For example, the predefined character entities and their represented character numbers for XML are as follows:

ENTITIES
	"amp"  38
        "lt"   60
        "gt"   62
        "quot" 34
        "apos" 39

See the XML declaration for XML for a complete example of an SGML declaration making use of predefined character entities.

Note that ISO 8879 Annex K requires that all mapped-to characters are contained in the syntax-reference character set, not just the document character set.

WebSGML Features

WebSGML's extensions to the FEATURES section (only mentioned as far as supported in sgmljs.net SGML) include

  • unbundling of SHORTTAG minimization features, meaning that certain shorttag minimization features can be switched on individually, rather than just collectively via FEATURES MINIMIZE SHORTTAG YES (which switches on all shorttag minimization features, among them those only used in historic shortform practices)

  • MINIMIZE IMPLYDEF ... options to allow WebSGML to process document instances lacking declarations for elements, attributes, and other components declarable in DTDs (which was generally not possible prior to the Annex K SGML revision)

A WebSGML FEATURES declaration portion can look as follows:

FEATURES

	MINIMIZE
		DATATAG NO
		OMITTAG NO
		RANK    NO
		SHORTTAG
			STARTTAG
				EMPTY    NO
				UNCLOSED NO
				NETENABL IMMEDNET
			ENDTAG
				EMPTY    NO
				UNCLOSED NO
		ATTRIB
			DEFAULT  YES
			OMITNAME NO
			VALUE    NO
		EMPTYNRM YES

	IMPLYDEF
		ATTLIST  YES
		DOCTYPE  NO
		ELEMENT  YES
		ENTITY   NO
		NOTATION YES

	LINK
		SIMPLE   NO
		IMPLICIT NO
		EXPLICIT NO

	OTHER
		CONCUR   NO
		SUBDOC   NO
		FORMAL   NO
		URN      NO
		KEEPRSRE YES
		VALIDITY NOASSERT
		ENTITIES
			REF      ANY
			INTEGRAL YES
FEATURES MINIMIZE SHORTTAG STARTTAG EMPTY NO UNCLOSED NO
asserts a document instance's use of certain historic shortform syntax

for sgmljs.net SGML, these must be NO

FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET

expresses, together with DELIM GENERAL NESTC and DELIM GENERAL NET as explained above, that an XML-style empty element is recognized as a short form of specifying the equivalent sequence of a start- and an end-element tag

note that, in addition, FEATURES MINIMIZE EMPTYNRM YES must be declared for being able to use XML-style empty-elements for elements with declared content EMPTY

sgmljs.net SGML accepts this setting only in combination with the settings for DELIM GENERAL NESTC and DELIM GENERAL NET as discussed above

FEATURES MINIMIZE SHORRTAG ENDTAG EMPTY NO UNCLOSED NO

these features (also for supporting historic markup shortform practices) must both have the value NO for sgmljs.net SGML

FEATURES MINIMIZE ATTRIB DEFAULT
expresses whether default values can be omitted in attribute specifications (ie. with the expectation that default values as declared in the attribute declarations are implied)
FEATURES MINIMIZE ATTRIB OMITNAME

expresses whether attribute names (and the VI delimiter) can be omitted in attribute specifications (ie. as in using name tokens for enumerated attributes, provided a name token can be be uniquely identified among those declared on the attributes of an element, including those declared on #ALL elements)

FEATURES MINIMIZE ATTRIB VALUE

expresses whether quotation characters (LIT and LITA delimiters) can be omitted around attribute values consisting entirely of name characters, even on undeclared attributes

FEATURES EMPTYNRM YES/NO

expresses whether elements with declared content EMPTY or implied-EMPTY elements (those having a content reference attribute specified) are allowed to have end-element tags (if YES)

FEATURES IMPLYDEF ATTLIST YES/NO

expresses that it isn't an error to specify undeclared attributes (if YES)

an undeclared attribute is treated as if it were declared CDATA #IMPLIED

FEATURES IMPLYDEF DOCTYPE YES/NO

expresses that it isn't an error if a document type declaration is absent from a document instance (if YES)

if FEATURES IMPLYDEF DOCTYPE YES is declared, and a document type declaration is absent, the document instance is treated as if <!DOCTYPE #IMPLIED SYSTEM> were present; the external subset is retrieved by forming a system identifier from the document element (the first element encountered), subject to SYNTAX NAMECASE GENERAL, then appending .dtd to it, and interpreting the resulting string as system identifier relative to the system identifier of the document instance being processed

note that if FEATURES IMPLYDEF ELEMENT YES is declared, then a document type declaration is also allowed to be absent; but, if in addition, IMPLYDEF DOCTYPE NO is declared, an absent document type declaration is treated as if <!DOCTYPE #IMPLIED> had been specififed (that is, its external subset is assumed to be empty)

FEATURES IMPLYDEF ELEMENT YES/NO

expresses that it isn't an error to use an undeclared element (if YES or ANYOTHER)

moreover, expresses that it isn't an error if a document type isn't present; see above

FEATURES IMPLYDEF ELEMENT YES has the effect that undeclared elements are implied as if declared - O ANY

FEATURES IMPLYDEF ELEMENT ANYOTHER expresses that, in addition, directly nesting undeclared elements isn't intended for the document instance, and has the effect that an end-element tag (closing the open element) before a start-element tag is inferred, if the element beginning with the start-element would otherwise be treated as direct child content of an element with the same element name

FEATURES IMPLYDEF ENTITY YES/NO

expresses that an entity reference for an undeclared entity is treated as if it were declared system-specific (ie. declared <!ENTITY ... SYSTEM>)

the data character content of the entity reference, if it is used in a parameter or general entity reference (other than a data text entity reference), is retrieved by interpreting the entity name (subject to SYNTAX NAMECASE ENTITY) as system identifier

FEATURES IMPLYDEF NOTATION YES/NO
expresses that it isn't an error if an undeclared notation is used
ignored by sgmljs.net SGML (notations must always be declared)
FEATURES OTHER URN YES/NO

if FEATURES OTHER FORMAL YES is declared (and if ISO 8870:1986 (WWW) is used as minimum data), only then can FEATURES OTHER URN YES also be used

FEATURES OTHER URN YES enables the URL/URN syntax for public identifiers (as an alternative to the standard formal public identifier syntax)

FEATURES OTHER KEEPRSRE YES/NO
this has the effect that SGML's traditional behaviour with respect to suppression of newlines and space characters is switched off
SGML's traditional behaviour (simplified) is, in text records consisting entirely of a start-element tag, character data, and an end-element tag for the same element as in the start-element tag, not to report initial space and trailing space characters, including the trailing RE (newline) character, as data characters, when the start- and end-element tag is for a declared (rather than included) element (ie. because such space and newline characters are considered insignificant, and present for markup text formatting purposes only, in such a way that every markup element is started on its own line)

sgmljs.net SGML only supports YES as the value for this setting, meaning that all space and newline characters will always be reported as character data (note that behaviour for supporting KEEPRSRE NO isn't specified for undeclared elements in WebSGML)

FEATURES VALIDITY TYPE/NOASSERT
asserts that the document is considered valid with respect to the notions expressed next and is used to indicate that validation with respect to the desired validitation level should be performed by sgmljs.net SGML

VALIDITY NOASSERT means that no content model validation, but only balancedness-checking (appealing insofar to XML's wellformedness criteria) is performed; VALIDITY TYPE means that regular validation and tag inference is performed

FEATURES OTHER ENTITIES ...
asserts certain characteristics of a document's use of entities appealing to notions introduced with XML
ignored by sgmljs.net SGML

Character Sets and Encodings

A character set, in SGML, is a mapping of character numbers to characters. A character can be referred to by a character name such as those used by ISO 10646 (aka. Unicode); for example the character rendered in this text just here: &, can be referred to as the character named AMPERSAND.

Having a name, a character, in SGML and also ISO 10646, is considered a concept existing independently of its conventional graphical rendering and of its character number in a particular character set. But in order to refer to a character, a character number and thus, a definition of a character set must be established in a context where it's impractical to refer to characters by their names.

Hence, while SGML conceptually defines an abstract syntax as a mapping of characters (rather than character numbers) to markup delimiters and other function roles, an abstract syntax can only be expressed as a mapping from character numbers in a character set to markup delimiters and other character roles.

SGML uses the term syntax-reference character set to refer to the character set used in an SGML declaration body for assigning meaning to characters as SGML markup delimiters and other character function roles. Furthermore SGML uses the term concrete syntax to refer to the larger portion of an SGML declaration which contains the mapping, including the declaration of the syntax-reference character set it is using.

Given a concrete syntax, an SGML parser is supposed to assess the characters represented by input character data (using the document character set), then assess whether the concrete syntax defines a delimiter or other role to it (depending on context). For the latter, the SGML parser must map a character presented to it in the document character set to the equivalent character in the syntax-reference character set. The SGML declaration itself doesn't contain a mapping between character sets, hence the SGML parser must rely on build-in character set infomation available to it.

Thus, even if the syntax-reference base character set can be theoretically different from the document base character set (unless if the concrete syntax is embedded in the document instance itself, see below), an SGML parser must still be able to establish a mapping for all characters in the document base character set to a character in the syntax-reference base character set.

SGML was originally devised at a time when a generally accepted character set wasn't yet established for referring to characters. Today, of course, the Universal Coded Character Set (UCS, defined in ISO/IEC 10646, and also known as Unicode) is used for this purpose. Since ISO/IEC 10646 contains over 120.000 code points (character numbers), if it is used as a document instance's base character set (which it should), there just doesn't exist a character set other than UCS itself with the same coverage. For this reason, the distinction between document and syntax-reference character set is irrelevant in practice, but nevertheless requisite knowledge to explain the character set notions in the SGML declaration.

Nevertheless, some of the concepts related to constructing a customized described set by remapping UCS character planes or communicating the purpose of private-use characters can be useful for special applications (ie. precisely because of its coverage, merely specifying UCS as document character set isn't helpful in communicating which character ranges are actually used in a document or required for a particular application, font face or variant, printer or other equipment,, vertical, etc.). Note sgmljs.net SGML doesn't include, however, integrated facilities for checking and/or remapping in regular builds.

For all intents and purposes within sgmljs.net SGML, the Universal Coded Character set is used as a base character set for both the document as well as syntax-reference character set.

In the basic SGML declaration above, the International Reference Version character set is used (which is the only character set supported by regular sgmljs.net SGML builds in addition to UCS). International Reference Version or IRV is the term used in international specifications to refer to the ISO/IEC 646 character set known as US-ASCII (technically, the version referenced by ISO/IEC:1983 differs from US-ASCII, and from that referenced by ISO 646IRV:1991, but not in a way relevant to SGML). IRV contains the first 128 code points of UCS, which uses the usual encoding of the US-ASCII character set into bytes interpreted as binary numbers for its character numbers.

Document encoding

A character number is different from a representation of a character as a bit pattern within a particular character encoding such as UTF-8 (even though a character number can be algorithmically determined from an UTF-8 representation); rather, character numbers and character sets are purely organizational concepts to identify and otherwise refer to characters in general.

For being able to read an SGML declaration, of course, an SGML parser must be able to interpret the bytes of an entity according to an encoding of a character set. The character set encoding of a document instance can't be meaningfully stated in its SGML declaration (if it has one), if the SGML declaration is part of the document instance itself (ie. because the SGML declaration must use the same encoding as the document instance, hence the processor still needs additional out of band information with respect to the encoding, else wouldn't be able to read the SGML declaration).

Having to deal with SGML declarations, which are a somewhat archaic, but in any case inconvenient format for conveying processing parameters to an SGML processor, only to find out that such a basic fact about a document instance as its character encoding can't effectively be expressed using it is considered unfortunate. Moreover, having to resort to out-of-band information such as command line processing options or similar in order to being able to parse a document is considered inadequate for SGML, especially with respect to SGML's attractiveness for archival purposes where it is deemed desirable to manifest a document character encoding.

Within established SGML technology, there are the following plausible mechanisms to inform the SGML parser about the character encoding used by a document instance and of bootstrapping an SGML parser into applying a desired character decoding:

  • SGML itself normatively references ISO 2012 code switching techniques as code extension facility; using this mechanism allows an SGML processor to start out in a mode where it accepts only IRV/ASCII characters, and then (virtually) "switches" into the desired mode of accepting eg. an UTF-8 encoding of UCS, based on the designating sequence of the public identifier of the document's base character set (subject to the ISO 8879 provisions with respect to delimiter recognition, this can also be extended to other multi-byte encodings as well)
  • using a wrapper document instance and refer to a main document instance via an entity reference; the reference is declared as an external entity using a formal system identifier which admits additional metadata such as character encoding and similar; (eg. cf. the bctf parameter of ISO 10744 extended facilities eg. FSIDR); this technique can be used with eg. (Open)SP, but isn't supported with sgmljs.net SGML right now even though sgmljs.net SGML supports ISO 10744 FSIDR in general

  • using an SGML catalog, which can associate an SGML declaration to a document instance without having to place an SGML declaration or declaration reference in a document instance)

sgmljs.net SGML only supports the first mechanism of the discussed techniques, and only for UTF-8; the alternatives are discussed for information only.

Note since sgmljs.net SGML uses ISO 2012 to learn about the character encoding of a document, the listing of supported character sets given below includes designating sequences which represent a UTF-8 character encodings.

Note when a character encoding is changed, this has no bearing on the character set, ie. the character numbers used in numeric character references; this is apparent eg. with HTML, which even when served over HTTP with ISO-8859 encoding (which used to be the standard encoding before HTML5) can contain numeric character references that still will be interpreted as UCS code points.

For an overview of ISO 2022, please refer to ECMA-35 Character Code Structure and Extension Techniques, which is identical to ISO/IEC 2022:1994 and made available by ECMA International for public access.

Base Character Set

The following public identifiers are recognized character sets by sgmljs.net SGML:

  • ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0

  • ISO 646IRV:1991//CHARSET International Reference Version (IRV)//ESC 2/8 4/2

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UTF-8 Level 1//ESC 2/5 4/7

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UTF-8 Level 2//ESC 2/5 4/8

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UTF-8 Level 3//ESC 2/5 4/9

  • ISO Registration Number 177//CHARSET ISO/IEC 10646:2003 UCS with implementation Level 3//ESC 2/5 2/15 4/6

  • ISO Registration Number 177//CHARSET ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6

Note that even though UCS-4, as used in the last public identifier/designation sequence in the list, denotes an alternate UCS encoding, this particular public identifier is interpreted to denote just the UCS character set, and acts exactly the same as the UTF-8 designation sequences.

Syntax-Reference Character Set

The document character set is the term used by SGML to refer to the character set used by a document instance.

The syntax-reference character set is the character set used for an SGML concrete syntax declaration. As shown in the basic declaration for SGML, the concrete syntax fragment can conceptually (but not actually) be stored in another entity and then referenced from the SGML declaration.

As also discussed in the introduction, hence a concrete syntax needs it's own character set definition, independent of the document character set used by a document instance referencing the concrete syntax.

If a concrete syntax definition isn't referenced via a public identifier, but is presented embedded in the SGML declaration code text of document itself, then it of course must be using the same character set as the document character set of the document which it is part of.

For all intents and purpose, a character number as used in sgmljs.net SGML is a single UCS (ISO 10646 or, equivalently, Unicode) code point, independently of the document encoding (such as UTF-8) being used. Apart from character numbers in the SGML declaration, UCS code points are also used in character entity references in a document instance.

The SGML declaration code text itself is always using (just) the IRV/ASCII character set, and when referring to a character number, is using either a character literal (when the character number/code point is contained in IRV/ASCII and is a graphic character) or, alternatively, a character number (when it is not, or when the author chooses to use a number rather than a literal for specifying it).

Naming

In SGML, the definition of permitted characters for names and name tokens

  • of generic identifiers of elements, attributes, and notations, and,
  • values of attributes with declared value ENTITY, ID, IDREF, NAME, NMTOKEN, or NOTATION, and the attributes with declared value ENTITIES, IDREFS, NAMES, or NMTOKEN for specifiying multiple space-separated name tokens, and

  • entity names

is controlled uniformly in the NAMING section of the used SGML declaration, meaning the declaration is applicable for all these names and name-like constructs at once.

SGML distinguishes name start characters, which can appear as the first character of a name token, from name characters, which can appear anywhere in a name token. Specifically, the digits can't start a name token. By default, unless more characters are added to the set of name characters or name start characters, respectively), as explained next, the upper and lowercase IRV letters are accepted as name start characters, while the digits are accepted in addition as name characters.

Nname tokens can be normalized into an uppercase form for the purpose of validation and tag inference (and output, if any), provided that the mapping can be specified for each character or character range individually (eg. rather than by reference to Unicode case conversion procedures), using the LCNMSTRT, UCNMSTRT, LCNMCHAR, and UCNMCHAR parameters.

These parameters contain either each a quoted parameter literal containing characters (as character literals), or a space-separated list of character numbers or character ranges, and have the following meaning:

LCNMSTRT (lowercase name start characters)
describes lowercase characters used as name start characters in addition to the IRV lowercase and uppercase letters
`UCNMSTRT (uppercase name start characters)

describes exactly as many characters as LCNMSTRT, and contains the uppercase letter for the corresponding lowercase letter in LCNMSTRT at the same position

LCNMCHAR (lowercase name characters)

describes lowercase characters used as name characters in addition to the lowercase name start characters in LCNSTRT

UCNMCHAR (uppercase name characters)

describes exactly as many characters as UCNMCHAR and contains the uppercase letter for the corresponding lowercase letter in UCNMCHAR at the same position

The SGML declaration admits case folding/canonicalization to be switched on for these two groups of name tokens individually

  • entities (SYNTAX NAMECASE ENTITY YES/NO)

  • and for all other name token uses (SYNTAX NAMECASE GENERAL YES/NO)

but not for more granular subsets of the other name tokens.

Extended Naming Rules

When extended naming rules are used, as indicated by the "ISO 8879:1986 (ENR)" (or the "ISO 8870:1986 (WWW)") minimum literal/data, the naming section of an SGML declaration contain the additional NAMESTRT and NAMECHAR parameters.

Moreover, extended naming rules enable character ranges to be used with naming parameters, rather than just lists of individual character numbers.

A naming section making use of extended naming rules can look as follows:

NAMING	LCNMSTRT ""
	UCNMSTRT ""
	NAMESTRT ""
	LCNMCHAR ""    
	UCNMCHAR ""
	NAMECHAR ".-_:"

The effect of using NAMESTRT and NAMECHAR, respectively, is that the declaration is treated as if the value for NAMESTRT had been used in both LCNMSTRT and UCNSTRT; likewise, the NAMECHAR value is interpreted as if the parameter literal had been used in both LCNMCHAR and UCNMCHAR.

When using extended naming, the literals for the LCNMSTRT, USNMSTRT, LCNMCHAR, and UCNMCHAR parameters are left empty in the SGML declaration.

SGML's uppercase bias

Note that SGML has a built-in preference for the uppercase form of characters if NAMECASE GENERAL YES is applied, in that

  • the lowercase and uppercase letters are always considered both name and name start characters (cf. ISO 8879 Clause 189); ie. these cannot be excluded from the set of admissable characters for name tokens at all

  • likewise, the definition of a larger character set for name tokens versus those in the SGML reference concrete syntax and the associated lowercase-to-uppercase mapping rules afforded by the LCNMSTRT, UCNMSTRT, LCNMCHAR, and UCNMCHAR SGML declaration parameters (and NAMESTRT/NAMECHAR introduced with the extended naming rules according to ISO 8879 Annex J) can only contain characters in addition to the IRV/ASCII letters and digits; in particular, for the letters in IRV/ASCII, no customized uppercase letter can be mapped; this is enshrined in ISO 8879 Clause 198, 22 which reads "A character assigned to LCNMCHAR, UCNMCHAR, LCNMSTRT, or UCNMSTRT cannot be an LC Letter, UC Letter, Digit, RE, RS, SPACE, or SEPCHAR"; consequently, a lowercase IRV/ASCII letter is always case-folded with build-in SGML rules when NAMECASE GENERAL YES is effective

SGML's uppercase bias isn't affected by ISO 8879 Annex J, which only alters the rule for the NAMING production so as to allow character ranges instead of just single character specifications, and also adds NAMESTRT and NAMECHR as a short-form, but not essential form of specifying the values of LCNMSTRT, UCNMSTRT, LCNMCHAR, and UCNMCHAR when the upper- and lowercase variants are identical.

Note that uppercase is only conceptually the preferred form, ie. for the purpose of defining SGML validation in the specification text. SGML applications are free to output or otherwise convey markup as they see fit. ISO 8879 doesn't put any constraints on these, nor defines a canonical SGML processing application or API apart from defining the line-oriented ESIS SGML representation used by ISO 8879's test case suite and some older Perl programs (the Grove in-memory representation, which in many ways is the predecessor to W3C's DOM API, isn't part of ISO 8879).

Hence, whether uppercase or lowercase is used internally by an SGML processor, or whether the processor makes this distinction at all, doesn't have any consequences for the externally observable behaviour of SGML applications as far as ISO 8879 is concerned. For example, sgmljs.net SGML has an option to output HTML markup in lowercase form, while being able to process SGML with NAMECASE GENERAL YES without restrictions.

SGML declaration for HTML5

The SGML for HTML5, applied by sgmljs.net SGML by default when eg. processing .html files or an entity with text/html media type fetched via HTTP, is explained in detail in the HTML5.1 DTD reference.

SGML declaration for HTML4

As an example for a plausible SGML declaration, the following start of an HTML document contains a variant for the (historic) SGML declaration for HTML 4.0. It differs from the official SGML declaration of HTML 4.01 only in its use of the extended WebSGML FEATURES declaration syntax to match actual HTML usage.

Note the declaration as shown here doesn't declare HTML predefined entities for space reasons, and thus can't be used for HTML content containing entity references; the variant of the SGML declaration for HTML5 for use with the Permissive HTML5.1 DTD does contain these and other declaration, though.

<!SGML "ISO 8879:1986 (WWW)"

  -- based on the SGML declaration for HTML 4.01 --
 
CHARSET
         BASESET   "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000
                ENTCAP          150000

SCOPE    DOCUMENT
SYNTAX
         SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
	           17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0

         FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"    
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO
         DELIM    GENERAL  SGMLREF
                  HCRO     "&#38;#x" -- ampersand --
                  NESTC    "/"
                  NET      ">"
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTCNT   120     -- increased for HTML 5 --
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 150
                  GRPCNT   150     -- increased for HTML 5 --

FEATURES
        MINIMIZE DATATAG  NO
                 OMITTAG  YES
                 RANK     NO
                 SHORTTAG
                          STARTTAG EMPTY    NO
                                   UNCLOSED NO
                                   NETENABL IMMEDNET
                          ENDTAG   EMPTY    NO
                                   UNCLOSED NO
                          ATTRIB   DEFAULT  YES
                                   OMITNAME YES
                                   VALUE    YES
                 EMPTYNRM YES
                 IMPLYDEF ATTLIST  YES
                          DOCTYPE  NO
                          ELEMENT  YES
                          ENTITY   NO
                          NOTATION NO
         LINK
                 SIMPLE   NO
                 IMPLICIT NO
                 EXPLICIT NO
         OTHER
                 CONCUR   NO
                 SUBDOC   NO
                 FORMAL   NO
                 URN      NO
                 KEEPRSRE YES
                 VALIDITY NOASSERT
                 ENTITIES
                          REF      ANY
                          INTEGRAL NO
APPINFO NONE
>
<!DOCTYPE HTML>
...
CHARSET ...

the declaration uses the same CHARSET BASESET character set declaration as the SGML declaration for HTML 4.01; the declaration admits most Unicode characters; in practice, any valid UTF-8 byte sequence in content or attribute values is admitted, but note sgmljs.net doesn't enforce this and will admit any byte in content or attributes, whether it forms part of a valid UTF-8 byte sequence or not (except those having a special delimiter role in SGML such as the < character)

SYNTAX ... BASESET ... DESCSET ...

the declaration restricts generic identifiers (used for element, attribute, notations, declaration set, and entity names) to the ASCII characters A through Z, a through z, the decimal digits; in addition, the characters . (dot), - (hyphen), _ (underscore), and : (colon) are accepted as the second and subsequent characters, but not as the first character of generic identifiers (note, however, that the SGML processor doesn't enforce these rules)

FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET,FEATURES MINIMIZE EMPTYNRM YES

the declaration is suited for inclusion of SVG and/or MathML as HTML 5 "foreign elements"; specifically, XML-style empty elements are accepted in SVG and MathML; moreover, HTML 5 "self-closing" tags are accepted in HTML content as well (irrespective of whether those are declared "void" element in the HTML 5 spec); also, declarations for NESTC and NET characters have been declaerd as appropriate for XML; note that XML predefined entities are not declared

IMPLYDEF ELEMENT YES

undeclared elements will be accepted and treated as if they were declared <!ELEMENT elmt - - ANY>; according to this setting, only contents of elements which are declared in a DTD will be validated (but see FEATURES OTHER VALIDITY NOASSERT which effectively switches off any content model validation)

QUANTITY SGMLREF ...
quantities have been adapted so that (Open)SP SGML processing tools can process DTDs for HTML 5; quantity declarations are not required for sgmljs.net SGML
FEATURES OTHER KEEPRSRE YES

newline and carriage return characters will be preserved in content (rather than being interpreted according to SGML rules for omissible whitespace); note that sgmljs.net doesn't support another setting for KEEPRSRE

FEATURES OTHER VALIDITY NOASSERT

no content model or attribute validation is performed; only balancedness of start-element and end-element tags is checked, including checks for elements with declared content EMPTY, which may or may not have end-element tag or a "self-closing" start-element tag

SGML declaration for XML 1.0 (Fourth Ed. or earlier)

The following declaration is applied by default if a file being processed has an .xml suffix, or begins with an XML declaration, or begins with this declaration.

The XML Fifth Ed. and the XML 1.1 specification revisions have extended the set of admissible characters in name tokens to cover allmost all UCS code points, hence the declaration text for these revisions can be much shorter. However, these newer XML specifications are widely considered not representative of actual XML usage, and no official ISO/IEC 8879 SGML declarations for these newer XML versions has been released yet.

For interoperability, only use of the official SGML declaration for XML 1.0 exactly (up to whitespace and comments) as given here is supported, and use of variant declarations is strongly discouraged, until a new offical or at least generally accepted SGML declaration for XML is established.

<!SGML "ISO 8879:1986 (WWW)"

     -- SGML Declaration for XML 1.0 --

     -- from: 
        Final text of revised Web SGML Adaptations Annex (TC2) to ISO 8879:1986
        ISO/IEC JTC1/SC34 N0029: 1998-12-06
        Annex L.2 (informative): SGML Declaration for XML

        changes made to accommodate validation are noted with 'VALID:'
     --

     CHARSET
         BASESET "ISO Registration Number 177//CHARSET
                 ISO/IEC 10646-1:1993 UCS-4 with implementation
                 level 3//ESC 2/5 2/15 4/6"
         DESCSET
                 0        9  UNUSED
                 9        2       9
                11        2  UNUSED
                13        1      13
                14       18  UNUSED
                32       95      32
               127        1  UNUSED
               128       32  UNUSED
               160    55136     160
             55296     2048  UNUSED  -- surrogates --
             57344     8190   57344
             65534        2  UNUSED  -- FFFE and FFFF --
             65536  1048576   65536

     CAPACITY NONE  -- Capacities are not restricted in XML --

     SCOPE DOCUMENT

     SYNTAX
         SHUNCHAR NONE
         BASESET "ISO Registration Number 177//CHARSET
                 ISO/IEC 10646-1:1993 UCS-4 with implementation
                 level 3//ESC 2/5 2/15 4/6"
         DESCSET
             0 1114112 0
         FUNCTION
             RE    13
             RS    10
             SPACE 32
             TAB   SEPCHAR 9
         NAMING
             LCNMSTRT ""
             UCNMSTRT ""
             NAMESTRT
                 58 95 192-214 216-246 248-305 308-318 321-328
                 330-382 384-451 461-496 500-501 506-535 592-680
                 699-705 902 904-906 908 910-929 931-974 976-982
                 986 988 990 992 994-1011 1025-1036 1038-1103
                 1105-1116 1118-1153 1168-1220 1223-1224
                 1227-1228 1232-1259 1262-1269 1272-1273
                 1329-1366 1369 1377-1414 1488-1514 1520-1522
                 1569-1594 1601-1610 1649-1719 1722-1726
                 1728-1742 1744-1747 1749 1765-1766 2309-2361
                 2365 2392-2401 2437-2444 2447-2448 2451-2472
                 2474-2480 2482 2486-2489 2524-2525 2527-2529
                 2544-2545 2565-2570 2575-2576 2579-2600
                 2602-2608 2610-2611 2613-2614 2616-2617
                 2649-2652 2654 2674-2676 2693-2699 2701
                 2703-2705 2707-2728 2730-2736 2738-2739
                 2741-2745 2749 2784 2821-2828 2831-2832
                 2835-2856 2858-2864 2866-2867 2870-2873 2877
                 2908-2909 2911-2913 2949-2954 2958-2960
                 2962-2965 2969-2970 2972 2974-2975 2979-2980
                 2984-2986 2990-2997 2999-3001 3077-3084
                 3086-3088 3090-3112 3114-3123 3125-3129
                 3168-3169 3205-3212 3214-3216 3218-3240
                 3242-3251 3253-3257 3294 3296-3297 3333-3340
                 3342-3344 3346-3368 3370-3385 3424-3425
                 3585-3630 3632 3634-3635 3648-3653 3713-3714
                 3716 3719-3720 3722 3725 3732-3735 3737-3743
                 3745-3747 3749 3751 3754-3755 3757-3758 3760
                 3762-3763 3773 3776-3780 3904-3911 3913-3945
                 4256-4293 4304-4342 4352 4354-4355 4357-4359
                 4361 4363-4364 4366-4370 4412 4414 4416 4428
                 4430 4432 4436-4437 4441 4447-4449 4451 4453
                 4455 4457 4461-4462 4466-4467 4469 4510 4520
                 4523 4526-4527 4535-4536 4538 4540-4546 4587
                 4592 4601 7680-7835 7840-7929 7936-7957
                 7960-7965 7968-8005 8008-8013 8016-8023 8025
                 8027 8029 8031-8061 8064-8116 8118-8124 8126
                 8130-8132 8134-8140 8144-8147 8150-8155
                 8160-8172 8178-8180 8182-8188 8486 8490-8491
                 8494 8576-8578 12295 12321-12329 12353-12436
                 12449-12538 12549-12588 19968-40869 44032-55203

             LCNMCHAR ""
             UCNMCHAR ""
             NAMECHAR
                 45-46 183 720-721 768-837 864-865 903 1155-1158
                 1425-1441 1443-1465 1467-1469 1471 1473-1474
                 1476 1600 1611-1618 1632-1641 1648 1750-1764
                 1767-1768 1770-1773 1776-1785 2305-2307 2364
                 2366-2381 2385-2388 2402-2403 2406-2415
                 2433-2435 2492 2494-2500 2503-2504 2507-2509
                 2519 2530-2531 2534-2543 2562 2620 2622-2626
                 2631-2632 2635-2637 2662-2673 2689-2691 2748
                 2750-2757 2759-2761 2763-2765 2790-2799
                 2817-2819 2876 2878-2883 2887-2888 2891-2893
                 2902-2903 2918-2927 2946-2947 3006-3010
                 3014-3016 3018-3021 3031 3047-3055 3073-3075
                 3134-3140 3142-3144 3146-3149 3157-3158
                 3174-3183 3202-3203 3262-3268 3270-3272
                 3274-3277 3285-3286 3302-3311 3330-3331
                 3390-3395 3398-3400 3402-3405 3415 3430-3439
                 3633 3636-3642 3654-3662 3664-3673 3761
                 3764-3769 3771-3772 3782 3784-3789 3792-3801
                 3864-3865 3872-3881 3893 3895 3897 3902-3903
                 3953-3972 3974-3979 3984-3989 3991 3993-4013
                 4017-4023 4025 8400-8412 8417 12293 12330-12335
                 12337-12341 12441-12442 12445-12446 12540-12542

             NAMECASE
                 GENERAL NO
                 ENTITY  NO
         DELIM
             GENERAL  SGMLREF
             HCRO     "&#38;#x"
                      -- Ampersand followed by "#x" (without quotes) --
             NESTC    "/"
             NET      ">"
             PIC      "?>"
             SHORTREF NONE

         NAMES
             SGMLREF

         QUANTITY
             NONE -- Quantities are not restricted in XML --

         ENTITIES
             "amp"  38
             "lt"   60
             "gt"   62
             "quot" 34
             "apos" 39

     FEATURES
         MINIMIZE
             DATATAG NO
             OMITTAG NO
             RANK    NO
             SHORTTAG
                 STARTTAG
                     EMPTY    NO
                     UNCLOSED NO
                     NETENABL IMMEDNET
                 ENDTAG
                     EMPTY    NO
                     UNCLOSED NO
                 ATTRIB
                     DEFAULT  YES
                     OMITNAME NO
                     VALUE    NO
             EMPTYNRM  YES
             IMPLYDEF
                 ATTLIST  YES
                 DOCTYPE  NO
                 ELEMENT  YES
                 ENTITY   NO
                 NOTATION YES
         LINK
             SIMPLE   NO
             IMPLICIT NO
             EXPLICIT NO
         OTHER
             CONCUR   NO
             SUBDOC   NO
             FORMAL   NO
             URN      NO
             KEEPRSRE YES
             VALIDITY NOASSERT
             ENTITIES
                 REF      ANY
                 INTEGRAL YES

     APPINFO NONE

     SEEALSO "ISO 8879//NOTATION Extensible Markup Language (XML) 1.0//EN"
>
<!DOCTYPE ...>
...
BASESET
see notes above
SYNTAX ... BASESET ... DESCSET ...
irrespective of the range restrictions expressed in the declaration the processor admits all valid XML 1.0 Fifth Edition (or XML 1.1) generic identifiers

This declaration also has the following notable settings:

  • SYNTAX NAMECASE GENERAL NO

  • SYNTAX NAMECASE ENTITY NO

  • SYNTAX FEATURES MINIMIZE OMITTAG NO

  • SYNTAX FEATURES MINIMIZE RANK NO

  • SYNTAX FEATURES MINIMIZE IMPLYDEF DOCTYPE NO

  • SYNTAX FEATURES MINIMIZE IMPLYDEF ELEMENT YES

  • SYNTAX FEATURES MINIMIZE IMPLYDEF ATTLIST YES

  • SYNTAX FEATURES MINIMIZE IMPLYDEF ENTITY NO

  • SYNTAX FEATURES MINIMIZE SHORTTAG ATTRIB OMITNAME NO

  • SYNTAX FEATURES MINIMIZE SHORTTAG ATTRIB VALUES YES

  • FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET

  • FEATURES OTHER VALIDITY NOASSERT

  • FEATURES OTHER KEEPRSRE YES

  • added requirements ISO 8879/NOTATION Extensible Markup Language (XML) 1.0//EN

Default SGML declaration

If processing a file with suffix .sgm, a declaration with the following settings is applied:

  • SYNTAX NAMECASE GENERAL YES

  • SYNTAX NAMECASE ENTITY NO, `

  • FEATURES MINIMIZE RANK YES,

  • FEATURES MINIMIZE OMITTAG NO

  • FEATURES MINIMIZE IMPLYDEF DOCTYPE NO

  • FEATURES MINIMIZE IMLYDEF ELEMENT YES

  • FEATURES MINIMIZE IMPLYDEF ATTLIST YES

  • FEATURES MINIMIZE IMPLYDEF ENTITY NO

  • FEATURES MINIMIZE EMPTYNRM YES

  • FEATURES MINIMIZE SHORTTAG ATTRIB DEFAULT YES

  • FEATURES MINIMIZE SHORTTAG ATTRIB OMITNAME NO

  • FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET

  • FEATURES OTHER VALIDITY NOASSERT

  • FEATURES OTHER KEEPRSRE YES

  • FEATURES OTHER FORMAL YES

  • FEATURES OTHER URN YES

In addition, the default SGML declaration has the following link processing related settings:

LINK
	SIMPLE   YES 99
        IMPLICIT YES
	EXPLICIT YES 2

These settings enable link processing and templating.

SGML declaration for markdown

If processing a file with suffix .md, a declaration with the same settings as an .sgm file is applied. In addition,

  • predefined entities for HTML are active
  • the strings <file: , <http:, <mailto:, and <:, #, ##, ###, and others are declared as SHORTREF delimiters (note sgmljs.net doesn't support custom SHORTREF declarations, but the presentation of markdown as a SHORTREF application nominally requires these declarations, even though versions of (Open)SP SGML don't enforce their presence when declaring shortref maps)

Note that, as with .sgm files, validation isn't enabled (left at its default of NOASSERT).

Using the a public declaration reference such as the following

<!SGML MARKDOWN PUBLIC "+//IDN markdown.org//SD Markdown//EN">

in place of a full SGML declaration (where the MARKDOWN declaration set name, but not the +//IDN markdown.org//SD Markdown//EN public identifier can be chosen arbitrarily) enables markdown processing from any processed file, not just those ending in .md. Moreover, this declaration enables the following settings in addition to those enabled on .md files:

  • FEATURES MINIMIZE OMITTAG YES

  • FEATURES MINIMIZE RANK YES

  • FEATURES OTHER VALIDITY TYPE