SGML (like HTML, which is based on SGML), is a text format starting from the idea of organizing information by tagging or marking up text. SGML is a meta-language for describing markup vocabularies such as HTML and their parsing rules.
Consider the following basic HTML document:
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>Section Title</h1>
<p>Body Text with <a href="otherdoc.html">link to another document</a></p>.
<footer>Page Footer</footer>
</body>
</html>
The element grammar for this document can be described as a SGML Document Type Definition (DTD) as follows:
<!ELEMENT html - - (head?,body)>
<!ELEMENT head - - (title?)
<!ELEMENT title - - (#PCDATA)>
<!ELEMENT body - - (h1,p+)
<!ELEMENT h1 - - (#PCDATA)>
<!ELEMENT p - - (#PCDATA|a)>
<!ELEMENT a - - (#PCDATA)>
In this grammar, the regular expression head?,body
means
that that the content of the html
element is expected
to consist of an (optional) head
element, followed by
a body
element, and both the head
and the body
element
have grammar rules for their content, in turn. #PCDATA
means that text is expected at the respective position.
Given such a markup grammar and other declarations, SGML can
check the markup of a given document or a larger collection of documents, and enforce presence or absence of tags or attributes
infer tags and attributes not present in a document but desired for content delivery (as used for automatically adding boilerplate and structuring content in web applications, and to simplify content creation)
attach processing to elements or more complex contexts for generating dynamic web content or other template processing application for content production
SGML can be used for
content authoring and workflow organization using straightforward concepts such as files and folders, as well as more sophisticated declarative techniques for web applications
content delivery over the web, with rich facilities for fetching and preparing content from databases or web services, and for integration into mainstream web application stacks
sanitizing potentially malicious user content in dynamic web applications or content production processes (injection prevention).
searching, transforming, analyzing and otherwise processing web content and other markup documents.
The following sections describe the markup declarations that can be used in a DTD, and their effect on the respective markup constructs in document content.
The general form of an element declaration is
<!ELEMENT element-name [rank] [tag-omission-rules] content [exceptions]>
or
<!ELEMENT name-group [rank] [tag-omission-rules] content [exceptions]>
where
is a single element name to declare
is a list of element names to declare
an element list has the form (element1|element2|...|elementN)
.
is a non-negative decimal number which is treated as a rank suffix that the declared element must have when used in content
the element name is treated as a rank stem, rather than a complete element name, if a rank suffix is specified
an element declared with rank having a rank suffix specified in content (ie. ending in a number in a start-element tag), sets the implied rank suffix for any element tag in subsequent content
for an element declared with rank having its rank suffix omitted in content, the effective rank suffix is that of the most recent element declared in the same declaration that has a rank suffix specified; the most recent element doesn't necessarily have to be a parent element, but can be any preceding element
it's an error if the first occurrence of an element declared with rank in a document instance has its rank suffix omitted
with respect to rank minimization, sgmljs.net treats all elements declared with the same rank suffix in a DTD as if those were declared in the same declaration; ie. a rank suffix is not only inferred from prior elements declared in the same declaration, but from any prior element having a rank declared and specified in content
an element declared with rank is referenced by its rank stem and
rank suffix as concatenated name from other declarations; e.g. an
element declared with rank stem abc
and rank suffix 3
is referenced
as abc3
in content model expressions of other element declarations
where the element may occur as content model token
note that using element ranks does not in itself enable uses such as e.g. automatically assigning/incrementing header levels based on tag nesting levels; instead, rank omission always infers from the most recently specified rank: see rank-examples
- -
means both start- and end-tag must be specified
- O
means the end-tag can be left out
O -
means the start-element tag can be left out
O O
means both start- and end-element tag may be left out
in the above syntax rules O
refers to the letter O,
and -
to the minus character
there must be whitespace between the specifier for start- and end-tag omission rules
the tag-omission-rules specification may be left out altogether in
which case it defaults to - -
see Tag Inference for applying tag omission rules
either a Content Model, with surrounding parentheses
or ANY
, allowing any content
or EMPTY
, which forbids the element to have content
or CDATA
, which will make element content parse as character data
or RDATA
, which will make element content parse as character data,
with general entity references being expanded into the respective
entity replacement text
see below for detailed explanation
an expression of the form -(exclusions) +(inclusions)
where either the exclusions- or the inclusions-part, or both, can be omitted
if both the exclusion- and the inclusion-part is specified, then the inclusion-part must follow the exclusion-part
inclusions is a single element or a name group (a list of elements) allowed to occur anywhere and arbitrarily often in descendant content in addition to elements specified in the content model
exclusions is a single element or a name group of elements not allowed to occur in descendant content, even though allowed by the content model or included by an element declaration for a parent element
if an element is excluded, it can't be included by an element declaration for a descendant element (the inclusion is ignored)
an element that is required in a content model can't be excluded
it's an error for an element to be both excluded and included in the same declaration
if an element occurs at a position where it matches a model group token, and is also in the set of included elements, then it is accepted as content model token (inclusion of the element is ignored)
exceptions can only be specified for elements having a content
model or for elements with declared content ANY
See also exception examples.
An element declared ANY
, EMPTY
, or CDATA
is said to have
declared content.
ANY
content
When an element is declared to have ANY
content, any content
(character data or any nested elements, subject to the effective value of
IMPLYDEF ELEMENT
in the SGML declaration) may occur between
the element start- and end-tag.
EMPTY
content
When an element is declared to have EMPTY
content, it must be specified
either just in start-element tags (ie. end-element tags can't be used for that element at all), or,
if EMPTYNRM YES
is specified in the SGML declaration, with an
optional end-element tag immediately following the start-element tag.
Note that if, in addition to FEATURES MINIMIZE EMPTYNRM YES
, also
FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET
is specified in
the SGML declaration, and /
and >
are declared to have
the NESTC
and NET
delimiter roles, respectively, then any element having
no content (regardless of whether the element is declared EMPTY
),
can be specified as an XML-style empty element, ie. can be abbreviated
by <element/>
, instead of having to specify <element></element>
(see SGML declaration for details).
Note that sgmljs.net supports only the characters stated above for the
NESTC
and NET
delimiter role (or no assignment to these delimiters
at all). Moreover, sgmljs.net restricts supported combinations of the
FEATURES MINIMIZE EMPTYNRM
and the
FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET
SGML declarations properties to have either the values stated above, or
to have both the value NO
. The first combination, introduced with
WebSGML (the Annex K revision of SGML), corresponds to modern polyglot
markup writing (and is used by default in sgmljs.net), while the latter
corresponds to the traditional SGML authoring style.
Note that, when processing XML, empty elements are required to either
have end-element tags, or to be specified as XML-style empty elements
(ie. as <element/>
).
Note that apart from declaring an element to have EMPTY
content, an
element must also have empty content when a #CONREF
attribute is
specified on it; see #CONREF
in attribute default values.
CDATA
content
Elements declared CDATA
contain unparsed character data as child
content.
The &
(ampersand) character has no special meaning in content of
elements declared CDATA
: character sequences looking like named
entity references aren't expanded to replacement text, and are,
like character entity references, reproduced as-is to result markup.
A <
(lower-than) character followed by valid name start character
terminates content of elements declared CDATA
, just like regular
elements declared with content models.
Note the CDATA
reserved word is also used as declared value of
attribute declaration and entity declarations,
and as keyword in marked sections.
A content model specifies the sequence of sub-elements and/or character data content that an element's child content is expected to have.
It is specified by a content model expression. For example, the
content model expression a, b?, c*
describes a sequence consisting of a single
a
element, followed by an optional b
element, followed optionally
by a sequence of any number of c
elements.
A content model expression is an expression constructed from content model tokens and compositors, with optional grouping and nesting of subexpressions in parentheses.
Content model tokens are either
element names declared in the same or another element declaration within the same declaration set, or
the #PCDATA
token representing parsed character data being
allowed at the position in the content model expression where
it is specified
A compositor is one of the following characters, listed along with the compositor's application to operand elements and/or compound subexpressions, and its semantics:
operand?
(zero-or-one compositor)
means "zero or one" of the element or content model subexpression to which it applies
the operand element or content model subexpression to which the compositor applies is written to the left of the compositor
operand*
(Kleene star compositor)
means "zero or more" of the element or content model subexpression to which it applies
the operand element or content model subexpression to which the compositor applies is written to the left of the compositor
operand+
(plus compositor)
means "one or more" of the operand element or content model expression to which it applies
the operand element or content model subexpression to which the compositor applies is written to the left of the compositor
an expression such as a+
where a
is an element or subexpression,
is equivalent to a,a*
left-operand, right-operand
(comma compositor)
means "a sequence of the left, followed by the right operand" element name or content model subexpression
(operand)
(grouping)
expressions can be grouped in parentheses such that they can be used as operands to higher level compositors; when parentheses are omitted, content model expressions are parsed left-to-right, ie. a compositor to the left of an operand takes precedence over a compositor to the right
left-operand & right-operand
(allgroups-compositor)
means any sequence of the operand elements or subexpressions, provided that any one element or subexpression occurs at most once in total
when applied to an operand subexpression that has the "zero-or-one" compositor as top-most compositor, that operand subexpression isn't required to occur, but if it occurs, it must occur at most once anywhere in the content of the element being declared
the content model expression a & b & c
is equivalent to the content model
expression (a,((b,c)|(c,b)))|(b,(a,c)|(c,a))|(c,(a,b)|(b,a))
Note:
In sgmljs.net SGML, operands of the allgroup
compositor must be either
a single element name, or
a subexpressions having the zero-or-one compositor, the sole operand of which is a single element name.
More complex operands for the allgroup
compositor aren't supported.
If #PCDATA
is specified as content token, it is implicitly treated
as if (#PCDATA)*
were specified, ie. parsed character data is always
optional in content models.
Content models must be unambiguous, ie. any content token must be uniquely matched without looking ahead at subsequent content tokens for disambiguation. For example, the content model
(a,b)|(a,c)
is not unambiguous, since element a
can be matched as the beginning
of either (a,b)
or (a,c)
. On the other hand, the equivalent content
model expression
a,(b|c)
is unambiguous.
For automatic generation of required elements not present in content,
FEATURES MINIMIZE OMITTAG YES
must be enabled in the SGML declaration
(which it is by default, except when processing XML).
In the following description of SGML tag inference, trivial actions on special conditions aren't described, such as on
ANY
content models, or, equivalently, implied-ANY
elements;
implied-ANY
elements are elements having child elements with
implied element declarations ie. undeclared elements (when
allowed to occur via IMPLYDEF ELEMENT YES
)
EMPTY
elements, or, equivalently, implied-EMPTY
elements
(elements governed by content references)
inference of document elements (which is just a special case of general start-element tag inference).
See also tag inference examples.
Close definitely completed elements
Definitely completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration such that only an end-element tag for the enclosing (definitely completed) element is accepted at the context position.
A model group ending in an optional content token or in a content token with one-or-more compositor can't be definitely completed, and isn't considered for automatic closing.
It's an error if a definitely completed element's end-element tag isn't omissible at this point, because a start-element action cannot be accommodated at the context position.
Check if the start-element tag or parsed character data is accepted at the context position; that is, check it's accepted at the current position in the model group and isn't excluded via exclusion exceptions
Open contextually required elements
Elements are contextually required if the content model of the enclosing element accepts a single element at the context position as a required element.
The element to accommodate is not influential in opening elements here, only the state of model group(s) already opened is considered.
If a contextually required element is opened, and matches the content token to accommodate, tag inference is completed for this action
The following actions are performed by sgmljs.net SGML in addition (these and similar recovery actions are also performed by third party SGML parsers such as SP, but are reported as recoverable errors by those parsers, whereas sgmljs.net SGML performs these actions silently):
At step 3, if it isn't possible to open a contextually required element, and
the context is immediately below the document element (such that inferring an end-element tag at the context position will close the document element, and logically end the document), and
the element to accomodate is not declared to have rank or ends with a numeric token (see next section), and
there's a single transition over an element from the context state, and
the start-element tag of the element to transition over is omissible
then that single transitioned-over element is opened as if it were contextually required.
At step 3, if it isn't possible to open a contextually required element, and
the element to accommodate is declared as having rank, and
the element's rank suffix to accommodate is higher (numerically larger) than that of the parent element (or the parent has no rank in which case it is be treated as having rank 0), and
there's a single transition over a ranked element from the context state, and
the rank of that single transitioned-over element is the same as that of the element to accommodate, and
the start-element tag of the ranked element to transition over is omissible
then that single transitioned-over element is opened as if it were contextually required.
Moreover, if it isn't possible to open either a contextually required element or a rank-implied element as described, the parent element is closed, if it is potentially completed (see definition below).
At step 2, if the element to accomodate isn't accepted at the context position due to exclusion exceptions, close as many potentially completed parent elements as necessary until it is (ie. until no more exclusion exception apply to the element to accomodate, if possible).
Close potentially completed elements
Potentially completed elements are those whose required elements have all been parsed by previous actions, in the sequence declared in its content model declaration; as opposed to definitely completed elements, the model group may allow further optional elements, or end in a content token (or in a nested model group) having the one-or-more compositor.
If the end-element to accommodate matches the most recently closed element, tag inference is completed for this action
As an additional minimization feature, SGML supports
omission of start- and end-element tags. This feature
doesn't require any special markup declaration
and can be applied on any element (except on
start-element tags on the document element) subject
to the FEATURES MINIMIZE SHORTTAG STARTTAG EMPTY
and FEATURES MINIMIZE SHORTTAG ENDTAG EMPTY
SGML
declaration settings, respectively.
An empty start-element tag is treated as if it
were a start-element tag for the most recently
closed element. For example, the empty start-element
tag <>
in the following markup text
<foo>
<bar>...</bar>
<>...</bar>
</foo>
is interpreted as <bar>
start-element tag.
An empty end-element tag is treated as if it
were an end-element tag for the context element
(eg. name of the nearest unclosed element).
For example, </>
is equivalent to </bar>
in the following markup text:
<foo>
<bar>...</>
</foo>
Note: the SGML terms empty start-element tag (and
empty end-element tag) is used for the <>
and </>
tokens. An XML-style empty element token, on the other hand,
represents a different concept.
Declarations for attribute lists take the form
<!ATTLIST element-name attribute-name declared-value [default-value]
[attribute-name declared-value [default-value]] ...>
or
<!ATTLIST name-group attribute-name declared-value [default-value]
[attribute-name declared-value [default-value]] ...>
or
<!ATTLIST #ALL attribute-name declared-value [default-value]
[attribute-name declared-value [default-value]] ...>
where
is a single element name to declare attributes for
is a list of element names to declare attributes for
an element list has the form (element1|element2|...|elementN)
.
#ALL
declares the attribute on all (declared or undeclared) elements when used in place of element-name or name-group
is the name of the attribute to declare
is one of the following possible lexical value types
CDATA
, allowing any quoted string to be used as attribute value
ENTITY
, allowing a name token declared as entity name in the same
declaration set; the token doesn't need quoting
ENTITIES
, allowing, in addition to ENTITY
, a space-separated list
of name tokens declared as entity names; when actually specifying more
than a single entity name in content, the attribute value must be quoted
ID
, allowing a name token, which must be unique among all name tokens
used as ID
in a document, and which establishes an ID
value for
reference by IDREF
or IDREFS
attribute
IDREF
, allowing a name token used as ID
in the same
document; the token doesn't need quoting
IDREFS
, allowing, in addition to IDREF
, a space-separated list
of name tokens declared as ID
attribute value; when actually specifying more
than a single ID
value in content, the attribute value must be quoted
NAME
, allowing a name token; the token doesn't need quoting
NAMES
, allowing, in addition to NAME
, a space-separated list
of name tokens
NMTOKEN
, allowing, in addition to NAME
, a token beginning with
.
(dot), -
(minus), or _
(underscore), whereas NAME
allows
these characters to occur only at the second or subsequent position
in the attribute value
NMTOKENS
, allowing, in addition to NAMES
, a list of tokens, each of
which beginning with .
(dot), -
(minus), or _
(underscore)
NOTATION
, allowing the attribute value to have a notation name
specified in the enumerated list of permitted notation names
NUMBER
, allowing a sequence of digits as attribute value
NUMBERS
, allowing, in addition to NUMBER
, a list of numerical
values to occur
NUTOKEN
, allowing a sequence of digits, followed by a sequence
of letters (such as 64px
)
NUTOKENS
, allowing, in addition to NUTOKEN
, a list of
NUTOKEN
tokens
[data attribute specification]
, allowing for custom data
attribute checks and value normalization
(see Data attribute specification)
A single attribute list declaration can declare one or more attributes for one or more elements (when using the name group declaration variant).
Conversely, attributes of the same element can also be declared in multiple attribute list declarations (from potentially multiple declaration sets). But the same attribute for a given element can be effectively declared at most once in all applicable attribute list declaration for a given element, ie. multiple declarations for the same attribute on a given element aren't rejected, but only the first declaration, in document order (and by extension in the order in which declaration sets are processed) becomes effective while latter declarations are ignored.
See attribute declaration and use examples.
The default value is either
(for enumerated values) one of the enumerated values
(for NOTATION
attributes) one of the enumerated notation names
(for other attributes) an attribute value literal; only needs quotes if the default value isn't a name token
the token #REQUIRED
, which means the attribute must be specified,
and must have a value
the token #IMPLIED
, which means the attribute doesn't have to be
specified (is optional)
the token #CONREF
, which means that, if the attribute is specified,
then the element on which it is specified is treated as if it were
declared EMPTY
The token #FIXED
may be specified before default values of the
first, second, or third form above. When specified, the attribute
either must have the default value, or mustn't be used at all on
the respective element.
Note that assigning template entities
to attributes declared #CONREF
can have additional semantics
to the effect that the element on which the #CONREF
attribute is
specified gets replaced by external content.
An attribute declaration such as
<!ATTLIST elmt attr (val1|val2|val3) val1>
declares the attribute attr
on element elmt
.
The attribute can have either of the values val1
, val2
,
or val3
, and its default value (its value when not specified
on the element explicitly) is val1
.
Using the #ALL
keyword, it's possible to declare one or more attributes
on all elements; depending on whether undeclared elements are allowed
(eg. by using IMPLYDEF ELEMENT YES
or IMPLYDEF ELEMENT ANYOTHER
as explained
below), attributes declared in an attribute list declaration with #ALL
can
also be used on undeclared elements.
An attribute can be declared both in an #ALL
attribute list as well as
in a regular attribute list for a single element or an element namegroup
at the same time. If an attribute is declared both on an individual element
and on #ALL
elements, its usage must satisfy both declarations.
For example, an attribute can be declared to have an enumerated value in
an #ALL
attribute list, and can be declared to have a #FIXED
value in an
attribute list declaration for an individual element. In this way, it's
possible to model a common design pattern in DTDs, wherein an attribute
declaration can be declared on an individual element in a more specific
way than a generic declaration for the attribute in an #ALL
attribute
declaration, while the generic #ALL
declaration still expresses a baseline
declaration and common requirement for the attribute's use accross all
element used in a document.
It's a design error (and reported by sgmljs.net SGML as attribute validation
error on actual attribute use), if an attribute is declared both as an #ALL
attribute and as an attribute on an individual element, when the two
declarations are not satisfiable simultaneously. For example, a #FIXED
value for an attribute declared in an #ALL
attribute declaration can't
be refined by declaring a different #FIXED
value on an individual element
for the same attribute.
The order of an #ALL
declaration relative to an attribute declaration
of an individual element for the same attribute isn't significant
and doesn't change the interpretation of attribute declarations.
Moreover, #ALL
attribute declarations always apply to all elements
of the document type and DTD containing the declaration, irrespective
of whether element declarations are placed before or after the
respective #ALL
attribute declaration in document order (or are
present at all).
Note sgmljs.net doesn't support WebSGML's other keywords
(such as #IMPLICIT
) on attribute declarations in place of #ALL
.
Moreover, #ALL
isn't supported for data attributes (ie. attributes
of notations; see below).
In addition to the build-in parsing types for attributes as described above, attributes can be declared to have custom data types (this form of declaration makes use of notations explained in the next section).
For example, the following declarations
<!NOTATION html5-form-input
PUBLIC "+//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN">
<!ATTLIST elmt attr DATA html5-form-input>
declare the attr
attribute to have a lexical type
identified with a notation having the public identifier
+//IDN www.w3c.org/TR/html5//NOTATION HTML 5 Form Input Types//EN
.
This public identifier represents the collection of lexical datatypes
specified by HTML 5 form input validation, and
imposes validation and value normalization to attribute
(and plain text content of CDATA
and SDATA
data entities
entities declared to be in that notation.
WebSGML allows specifying attributes for the data library notation such as in
<!ATTLIST elmt attr DATA html5-form-input [ type="email" ]>
or
<!ATTLIST elmt attr DATA html5-form-input [ pattern="XYZ\d+" ]>
In the absence of the type
or pattern
attribute, sgmljs.net
will behave as if text
(the most basic HTML 5 input form
validation type) had been specified for the type
attribute.
text
accepts any text value as content, and the value normalization
applied is restricted to removing newlines, if present.
See form input value checking for more details on lexical value checking.
A notation, in general SGML terms, is a representation format for data such as the image formats PNG, GIF, or JPEG, or a text format such as TeX for typesetting mathematics.
Notation markup can be used to specify content in a different data representation format than SGML, either embedded in a SGML document, or as a reference to an external resource.
In sgmljs.net, the notation construct is also used to provide custom processing on markup for a broad class of applications such as content formatting and filtering; see templating,
A notation is declared as follows:
<!NOTATION notation-name identifier>
where
is the name of the notation to declare
is the public and/or system identifier for the notation. as used to identify the notation by either a built-in notation (SGML, SQL, SPARQL, etc.), or by an external custom notation; see identifiers
Notation attributes can be used to markup a piece of inline text as
"in a notation": in the following example, the characters \sqrt{2}
are marked up as TeX-formatted math:
<!doctype example [
<!element example (math)+)
<!element math CDATA>
<!attlist math format notation (tex) #implied>
<!notation tex public "TeX">
]>
<example>
<math format=tex>\sqrt{2}</math>
</example>
Note this is only an example of how to specify inline
notation data; the use of the ad-hoc public identifier
TeX
here won't cause sgmljs.net SGML to execute TeX
instructions.
Note that when using notation attributes, the content restrictions and entity expansion behaviour declared in the element declaration for the element on which it is declared and specified apply unchanged.
The syntax for declaring (and specifying values for)
NOTATION
declared attributes is very similar to that
of enumerated values; see attribute examples.
For using notations with external entities, see entities.
Like elements, notations can have attributes. Data attributes are used to configure properties of external data entities, or of inline notational content; see templating for details.
Data attributes are declared as follows:
<!ATTLIST #NOTATION notation-name attribute-name declared-value default-value
[attribute-name declared-value default-value] ...>
for data attributes, the same rules as for element attributes apply, with the following exceptions
data attributes can't have a declared value of ID
, IDREF
,
IDREFS
, NOTATION
, ENTITY
, or ENTITIES
(however, special
rules apply for templating)
unlike element attributes, data attributes must be declared
and aren't subject to MINIMIZE IMPLYDEF ATTLIST YES
when
declared in the SGML declaration
An entity, in SGML, is a stream of character data.
An entity declaration introduces a name for an entity for subsequent use in
the SGML prolog or in content. Parsed entities
(see general entities) are used for entity references,
which are replaced by the entity's character data on processing.
Unparsed entities (see data entities) are used
as values of ENTITY
(or ENTITIES
) attributes for
templating or are processed in other entity
type-specific ways.
The purpose of general entities is to support reuse of text at multiple places in a document by placing entity references for shared declared general entity as follows:
<!DOCTYPE doc [
<!ENTITY text "some <i>reusable</i> text">
<!ELEMENT doc - - (p+)>
<!ELEMENT i - - (#PCDATA)>
<!ELEMENT p - - (#PCDATA|i)>
]>
<doc>
<p>First use of the "text" entity follows: &text</p>
<p>Second use of the "text" entity follows: &text</p>
</doc>
In the example, &text
is a reference to the previously declared
text
(general) entity, and will expand to the string
some <i>reusable</i> text
in place.
Any markup contained in the entity replacement text will be interpreted as if it had been part of the text in which the entity reference is placed. This means that replacement text can contain tags (or any other SGML content construct such as marked sections, processing instructions, etc.). It may also contain further entity references in turn, which will be expanded in place recursively.
However, valid replacement text for an entity must not contain references to the entity being replaced itself (or, transitively, contain an entity reference expanding into a reference to the entity being expanded itself).
General entity references are expanded anywhere in content,
regular attribute specifications, and replacement content of general
entities, except in CDATA
marked sections, CDATA
content,
data text entities (CDATA
/SDATA
entities),
and attributes declared with data attribute specifications.
General entity references (as opposed to parameter entity references) aren't expanded in markup declarations.
General entities are lazily fetched at the time(s) an entity reference is parsed in content. When processing an entity declaration with replacement text containing references to further entities, no check is performed whether referenced entities are declared and/or accessible. In particular, unlike parameter entities, at declaration time, replacement text for general entities may contain references to other entities that aren't themselves declared (yet).
See also general entity examples.
Rather than specifying the replacement text for an entity literally, it's also possible to specify that replacement text should be retrieved from an external resource (such as a file or via HTTP) by declaring the entity as follows:
<!ENTITY ent SYSTEM "filename.txt">
where the part beginning with SYSTEM ...
(containing a file name in
the example) is an identifier.
For entities declared as follows
<!ENTITY ent CDATA "escaped replacement text">
or, equivalently,
<!ENTITY ent SDATA "escaped replacement text">
entity referencesare expanded into the respective literal replacement
text without further interpretation of the replacement text as markup.
If the replacement text contains characters or character sequences
that would be interpreted as markup delimiters (such as the <
or &
characters), then those characters will be expanded into
character entity references.
Consequently, general entity references and tags aren't recognized in data text entities; note, however, that the replacement text literal in a data text entity declaration is subject to parameter entity replacement.
In sgmljs.net, CDATA
and SDATA
data text entities are treated
identically.
Apart from CDATA
and SDATA
, also the PI
keyword can be used in data
text entity declarations.
This variant introduces an entity containing a processing instruction, and is the only variant that can also be used with parameter entities.
References to PI
data text entities can only be used in a context
where a processing instruction can be used; specifically, PI
data
text general entity references can't be used in attribute values.
In sgmljs.net, an external data text entity is declared using
the syntax for CDATA
and SDATA
data entities,
explained below.
Character entity references are strings of the form &#NNNNNN
where
NNNNNN
is a decimal number, or of the form &#xMMMMMM
where MMMMMM
is a hexadecimal number. The number refers to the code point in the
document character set (Unicode) represented by the character entity
reference.
Character entity references are passed as-is to the output; all browsers and markup processing tools are expected to be able to handle character entity references.
Entity declarations with a %
character following the ENTITY
keyword introduce parameter entities. Where general entity
declaration define replacement text for content, parameter entities
define replacement text in markup declarations.
For example, the following document type declaration set
contains a declaration for the idattr
parameter entity. The
parameter entity is then referenced twice in further declarations.
<!DOCTYPE doc [
<!ENTITY % idattr "id ID #IMPLIED">
<!ELEMENT doc - - (#PCDATA|p|ul|a)>
<!ELEMENT p - - (#PCDATA)>
<!ELEMENT ul - - (li+)>
<!ELEMENT li - - (#PCDATA)>
<!ELEMENT a - - (#PCDATA)>
<!ATTLIST doc %idattr>
<!ATTLIST p %idattr>
<!ATTLIST ul %idattr>
<!ATTLIST li %idattr>
<!ATTLIST a href CDATA #IMPLIED %idattr>
]>
...
Similar to general entity references, the %idaddr
parameter entity
reference is expanded into the replacement text
id ID #IMPLIED
so that all elements will have the same id
attribute declaration
as result.
Furthermore, the a
element will have the href
attribute in
addition to the id
attribute. Note that the purpose of reusing
an attribute declaration can also be achieved by using a name group
- a list of element names - in an ATTLIST
declaration (and furthermore
could also be achieved using WebSGML's #ALL
keyword in place of an
element name or name group).
A parameter entity reference must begin with the %
character.
A parameter entity declaration must have whitespace
between the %
character and the subsequent parameter entity name.
Apart from reusing parts of declaration text, parameter entities are used in particular for
customizing a generic external declaration set by overriding default declarations for parameter entities in the internal declaration set; see declaration sets
as placeholder for keywords in marked sections
designing declaration set text for reuse in general.
Unlike general entities, parameter entities are fetched eagerly as soon as an external parameter entity declaration is processed. Therefore, it is an error for the replacement text of a parameter entity to contain unresolved references to (other) parameter entities; references to parameter entities already declared in a prior declaration (in markup declaration text order), on the other hand, are recognized and expanded in parameter entity replacement text.
Parameter entities can also be used for fetching external content when external content can't or shouldn't be fetched multiple times as would be the case for external general entities, for example when fetching an external service response into a parameter entity for use in multiple references. when fetching from the standard input or from a network stream.
Parameter entity references are expanded in the replacement text for general entities (as well as in all other markup declaration except system identifier literals). This means that any parameter entity value can be re-declared (copied) as general entity by placing a parameter entity reference into the replacement text for a general entity.
Note that parameter (or general) entity references aren't expanded in system identifier literals (of markup declarations using external identifiers, such as entity and notation declarations). To construct a system identifier from a parameter entity, an additional, derived parameter entity is declared consisting of a reference to the parameter entity to construct from, with leading and trailing quote characters added; the derived parameter entity is then used as system identifier literal.
See also parameter entity examples.
Like general entity declarations, parameter entity declarations can point to a system identifier (a file or network location to fetch character data from), rather than providing inline replacement text as parameter literal.
An entity declaration with omitted system identifier literal
but containing SYSTEM
, such as the following
<!ENTITY ent SYSTEM>
declares an entity which is resolved by default to the filename
ent
. The file is searched for in the same directory as the file
declaring it (the resolved value or the directory to search can
be changed using runtime parameters).
Any entity that can be declared as external entity (general, data and parameter entities) can be declared system-specific.
When IMPLYDEF ENTITY YES
is specified in the SGML declaration,
general entity references to undeclared entities will be resolved
as system-specific entity. This means there is no need to specify
an entity declaration at all; entities can be referenced right away
provided the entity name can be resolved as file name, or another
resolution rule has been provided as invocation parameter.
Parameter entities, on the other hand, must always be declared. Note, however, that external data text entities can't be declared system-specific.
Entities can be declared to be in a notation as follows (where we first declare a notation to reference its name in the entity declaration):
<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY someent SYSTEM "some-entity" NDATA somenotation>
Entities declared like this are not considered SGML character data and won't be expanded into replacement text when used in an entity reference.
Instead, the SGML processor just reproduces entity reference for these as-is; special processing can be implemented and associated with a notation (ie. with a public identifier of a notation) via notation handlers and the SGML API. A standard notation handler is provided by the templating feature.
Data entities declared using the CDATA
or SDATA
keywords in place of NDATA
,
on the other hand, will be expanded into the respective replacement text
when used as entity reference:
<!NOTATION somenotation SYSTEM "some-notation-identifier">
<!ENTITY ent SYSTEM "some-entity" CDATA somenotation>
An entity reference to ent
will be expanded into the text contained
in the "some-entity" file; as with data text entities,
special characters such as <
or &
are escaped in the replacement text,
and not treated as markup delimiters.
See also data entity examples.
If a notation has data attributes, values for the data attributes
can (or must, if no #FIXED
or default values are provided) be
specified as shown in the following example:
<!NOTATION notation n system "some system id">
<!ATTLIST #NOTATION n x CDATA #IMPLIED y CDATA #IMPLIED>
<!ENTITY e SYSTEM "another system id" NDATA n [ x="val1" y="val2" ]>
where the first two declarations establish a notation with
data attributes x
and y
, and the NDATA
entity declaration for
the e
entity demonstrates the syntax for providing data attribute
values.
Short references are a facility to replace short spans of punctuation
mark and other characters in text content, such as dots, commas, tabs,
brackets, spaces, and others by entity references in a context-dependent way.
For example, short references can be used to replace a sequence of
two hyphen-minus characters (--
) into an ndash
character (roughly, a dash
the width of an n character). Using short references, text can be typed
using the hyphen-minus characters entered via standard keyboard keys,
yet can be rendered using the typographically and semantically more
desirable ndash character where appropriate.
<!SHORTREF shortref-map-name shortref-delimiter replacement-entity-name
[shortref-delimiter replacement-entity-name] ...>
Short references are declared in a short reference map declaration
for a named short reference map as shown in the following example,
which declares that a sequence of two hyphen-minus characters should
be replaced by a a reference to the mdash-ent
entity, which in
turn maps to the character entity reference for mdash
(Unicode
code point 8212 in decimal) in text portions when the my-shortref-map
short reference map is active:
<!ENTITY ndash-ent "–">
<!SHORTREF my-shortref-map
"--" ndash-ent>
As shown, a short reference map maps short reference delimiters to general entity names, rather than to replacement text directly.
<!USEMAP shortref-map-name element-name>
or
<!USEMAP shortref-map-name name-group>
To then activate the my-shortref-map
short reference map within
the text content of a specific element (P
in this example) or a
group of elements, a short reference use declaration is used:
<!USEMAP my-shortref-map P>
Short reference map declarations can map more than a single
short reference delimiter to an entity, as shown in the following
example, which, in addition to mapping double hyphen-minus characters,
also maps quotation mark (U+0022 QUOTATION MARK) characters to
typographic citation mark characters (U+201C LEFT DOUBLE QUOTATION MARK,
represented by “
as decimal character entity reference),
which might be typographically more appealing, depending on the
text language and typographic conventions:
<!ENTITY ndash-ent "–">
<!ENTITY curlyquot-ent "“">
<!SHORTREF enhanced-typography
"--" mdash-ent
'"' curlyquot-ent>
<!USEMAP enhanced-typography p>
Of course, when the HTML predefined entities are declared in
the SGML declaration (such as when processing .html
or .md
files, or when a SGML declaration activating the HTML predefined
entities is put in the file to process, as shown here),
a short reference map can directly refer to predefined entities,
rather than having to declare mdash-ent
and curlyquot-ent
in
the prolog:
<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
<!ELEMENT body - - (p+)>
<!ELEMENT p - - (#PCDATA)>
<!SHORTREF enhanced-typography
"--" ndash
'"' ldquo>
<!USEMAP enhanced-typography p>
]>
<body>
<p>"Murder" she said -- aka '4:50 from Paddington'</p>
</body>
The example imports the predefined entities for HTML,
declares a tiny HTML-like vocabulary and uses the quot
element to enclose quotations (though in HTML,
quot
elements would probably not be used for marking up
this particular inline quote in the way shown). Moreover, two
short reference maps and uses are declared: one for when
in child content of p
, starting a quot
element, and
another one for ending the quot
element from within
quot
content.
Invoking sgmlproc
on the above content will produce
<body>
<p>“Murder“ she said – aka '4:50 from Paddington'</p>
</body>
As the example shows, both double quote characters contained in
the text will be replaced by “
character entity references.
But typically, it will be desired to replace quote characters in pairs
such that only the quote character beginning a quote will produce
“
(U+201C LEFT DOUBLE QUOTATION MARK), while quote characters
ending a quote will produce ”
(U+201D RIGHT DOUBLE QUOTATION MARK).
To achieve this, a short reference use declaration can be placed
inline in content, as opposed to in the DTD. The following example places
a short reference use declaration in content (in addition to
using a short reference use declaration in the DTD) to toggle
into a short reference map in which the replacement text for
the double quote character maps to &rdquo
, thus closing
properly closing the quotation:
<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
<!ELEMENT body - - (p+)>
<!ELEMENT p - - (#PCDATA)>
<!SHORTREF quotation-formatting1 '"' ldquo>
<!SHORTREF quotation-formatting2 '"' rdquo>
<!USEMAP quotation-formatting1 sub>
]>
<doc>
<sub>
"<!USEMAP quotation-formatting2>Murder" she said
</sub>
</doc>
When placed in content, a short reference use declaration doesn't associate one or more element names to a short reference map, but instead immediately makes the specified short reference the current one. The short reference map remains current as long as the element in which the short reference use declaration is placed remains current, and is reset (and possibly assessed from short reference map declaration in the DTD) when another element becomes current.
As shown, this isn't very useful yet; the desired effect could be achieved much simpler by just using the character entity references for typographic quotations directly in content. However, a short reference use declaration can also be placed into the replacement entity text for a short reference map itself, as shown next:
<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
<!ELEMENT body - - (p+)>
<!ELEMENT p - - (#PCDATA)>
<!ENTITY quotation-open '“<!USEMAP quotation-formatting2>">
<!ENTITY quotation-close "”">
<!SHORTREF quotation-formatting1 '"' quotation-open>
<!SHORTREF quotation-formatting2 '"' quotation-close>
<!USEMAP quotation-formatting1 sub>
]>
<doc>
<sub>
"Murder" she said
</sub>
</doc>
As before, SGML will, upon encountering the first double quote
character, place the #“
replacement entity text
into the result document; then SGML will place
<!USEMAP quotation-formatting2>
into the result document as
well, which will make SGML immediately switch into replacing
the subsequent, second double-quote character into ”
.
A short reference use declaration in content can contain
the literal string #EMPTY
in place of a short reference
map name. Upon encountering such a short reference use declaration
in content, SGML will disable recognizing and replacing any short
reference delimiters until a new short reference map is made current
by either another short reference map declaration in content,
or by opening or closing elements such that an element
context is becoming current which does have a short reference
map associated via regular short reference map declaration
in the DTD.
While the above sketched solution solves the basic problem of placing typographic quotation marks around portions of text, manually placing short reference map use declarations in content to toggle short reference maps is too limiting when the text span to put into quotation marks needs to contain further span-level markup for formatting. For example, quoting a mathematical expression containing superscripted text (eg. for representing exponentiation) or other formatting can only be achieved by meticulously placing short reference use declarations into the short reference use maps of all elements which should be accepted in quotations.
As an alternative, the replacement entities of a short reference map can contain markup tags:
<!SGML HTML PUBLIC "+//IDN sgml.net//SD SGML declaration body for HTML//EN">
<!DOCTYPE body [
<!ELEMENT body - - (p+)>
<!ELEMENT p - - (#PCDATA|quot)+>
<!ELEMENT quot - - (#PCDATA)>
<!ENTITY start-quote "<quot>">
<!ENTITY end-quote "</quot>">
<!SHORTREF in-p
"--" mdash
'"' start-quote>
<!SHORTREF in-quote
'"' end-quote>
<!USEMAP in-p p>
<!USEMAP in-quote quot>
]>
<body>
<p>"Murder" she said -- aka '4:50 from Paddington'</p>
</body>
While the example markup in itself doesn't put citation marks around quoted text, it delegates formatting quotations for presentation to HTML/CSS or other mechanism able to attach element-specific formatting and processing rules such as SGML templating.
Generating start- and end-element tags via short references
is so common that a (general) entity declaration can
optionally be declared with a bracketed text type of
STARTTAG
to put STAGO (<
) and ETAGO >
characters
before and after, respectively, replacement text, forming
a start-element tag. Likewise, when a replacement entity
named in a short reference map is declared to have
bracketed text type ENDTAG
, then ETAGO (</
) and ETAGC
(>
) are put around the replacement text when referenced
in content. For example, the following two declarations
can be used in place of those in the example before to
produce exactly the same result:
<!ENTITY start-quote STARTTAG "quot">
<!ENTITY end-quote ENDTAG "quot">
The following short reference delimiters can be used as the literal to be recognized and replaced in short reference map declarations:
&#TAB;
&#RE;
&#RS;
&#RS;B
&#RS;&#RE;
&#RS;B&#RE;
B&#RE;
&#SPACE;
B
BB
"
#
%
'
(
)
*
+
,
-
--
:
;
=
@
[
]
^
_
{
|
}
~
where
B
recognizes one or more space or tab characters
BB
recognizes two or more space or tab characters
&#TAB;
recognizes a single tab character
&#SPACE;
recognizes a single space character
&#RS;
and &#RE;
recognizes a record-start, and record-end character; see notes below
&#RS;B
recognize a newline followed by a blank character
&#RS;&#RE;
recognizes an empty line (two consecutive newline characters)
&#RS;B&#RE;
recognizes a line containing a single space character
B&#RE;
recognizes a blank character followed by a newline character
are recognized when the respective character occurs verbatim in content
A short reference map is current while the associated element is the top-most one. Sub-elements descending from an element with an associated short reference map, like all other elements, don't have a current short reference map unless declared explicitly to have one in a short reference use declaration. That is, the current short reference map isn't inherited to child content elements.
Short reference maps associated to elements inferred at a context position by opening contextually required elements to accomodate character data are not asserted to be current by sgmljs.net SGML. For example, in the following instance
<!DOCTYPE doc [
<!ELEMENT doc - - (sub+)>
<!ELEMENT sub O O (subsub+)>
<!ELEMENT subsub O - (#PCDATA)>
<!ENTITY start-sub STARTTAG "sub">
<!ENTITY end-subsub ENDTAG "subsub">
<!SHORTREF in-doc "&#RS;" start-sub>
<!SHORTREF in-subsub "|" end-subsub>
<!USEMAP in-doc doc>
<!USEMAP in-subsub subsub>
]>
<doc>
sometext |</doc>
sgmljs.net SGML will
within doc
element content, produce a sub
start-element tag (as per the start-sub
replacement
entity of the in-doc
short reference map)
within sub
content content, infer a subsub
start-element tag (by subsub
's tag omission indicators
and the rules for tag inference)
but will not
recognize |
(the vertical bar character)
in the sometext |
text span, since the short reference
map for the inferred element subsub
only becomes current
after inference of the subsub
start-element tag.
Note the behaviour of short reference maps in the presence of tag omission/inference is regarded ambiguous in the SGML specification and thus not fully portable across SGML systems.
For recognition of the &#RS
and &#RE
short reference
delimiters, SGML considers input text to consist of records
which are text lines starting with a record start (RS)
and ending in a record end (RE) character. Actual text
content in text fetched via parsed entities will most of
the time contain just newline characters (mapped to an
SGML RS character) or other character sequences as line
terminators. For input data without RE characters,
RE characters, if not present, are inserted into text
content prior to short reference delimiter recognition
at the text position right before RS characters, except
at the end of a text chunk or the end of an external entity,
where line termination characters are removed alltogether,
and replaced by a single RE character.
Short reference delimiters thus can use the &#RS
and
&#RE
characters as reliable signals/anchors for
applying substitutions at the begin and end of text lines
instead of having to rely on newlines/line terminators
which are placed after every text line, even those not
followed by further text. For example, to parse text
content formatted as tab-separated (or comma-separated)
values, declararations such as the following can be used:
<!DOCTYPE tsv [
<!ELEMENT tsv - - (record+)>
<!ELEMENT record - - (field1,field2,field3)>
<!ELEMENT field1 - - (#PCDATA)>
<!ELEMENT field2 - - (#PCDATA)>
<!ELEMENT field3 - - (#PCDATA)>
<!ENTITY start-fields "<record><field1>">
<!ENTITY end-record "</field3></record>">
<!ENTITY end-field1 "</field1><field2>">
<!ENTITY end-field2 "</field2><field3>">
<!SHORTREF in-tsv "&#RS;" start-fields>
<!SHORTREF in-field1 "&#TAB;" end-field1>
<!SHORTREF in-field2 "&#TAB;" end-field2>
<!SHORTREF in-field3 "&#RE;" end-record>
<!USEMAP in-tsv tsv>
<!USEMAP in-field1 field1>
<!USEMAP in-field2 field2>
<!USEMAP in-field3 field3>
]>
<tsv>
A B C
1 2 3
</tsv>
(where the spaces between the A
, B
, C
and the
1
, 2
, 3
items represent single tab characters).
Note: the notion of records is applicable to any input
data parsed by SGML, not only to element content in the
presence of short reference declarations. However,
sgmljs.net SGML only considers record boundaries in
the context of short reference delimiter recognition,
and otherwise behaves according to the FEATURES OTHER KEEPRSRE YES
setting in the SGML declaration.
Within a short reference map with multiple delimiters, the declared delimiters are matched against input text data in an order honoring their relative norminal length and specificity as follows:
Short reference delimiter literals are compared against
already declared (in document order) short delimiter literals
of the same short reference map based on the number of
significant tokens such that &#RS;
, &#RE;
, &#SPACE;
,
and &#TAB;
are each counted as a single token, and all
other single characters (including B
characters) are
each also counted as a single token. Delimiter literals
having more tokens are considered more specific, and
wil be matched prior to delimiter literals with
fewer tokens. When two delimiter literals have the
same number of tokens, their respective tokens are
compared individually, such that if a token is
B
, and the corresponding token in the delimiter to
compare is either &#SPACE;
or &#TAB;
, then the token
with &#SPACE;
or &#TAB;
is considered more specific
(and will be matched against text input before the delimiter
with the B
token). For short reference delimiter literals
with equal length and specificy, the declaration order
will be used as fallback to determine the order of
matching text input against short reference delimiters.
Note: this will make a literal ending in either #&TAB;
or #&SPACE;
match a string with potentially unmatched
subsequent space characters, when the whole sequence,
including unmatched subsequent space characters could be
matched by a short reference delimiter ending in B
rather than #&SPACE
or &#TAB
and otherwise being
identical (eg. because B
represents one or more blanks).
That is, the comparison ranks the specificity of
a short reference delimiter higher than the maximal
length of a matched text span.
Note this is also considered behaviour not sufficiently elaborated in the SGML specification, and prone to limited portability across SGML systems.
Marked sections are for including or ignoring, respectively, a portion of SGML prolog or content, optionally depending on the value of a parameter entity.
For example, the following example contains a marked section around a content portion:
<!DOCTYPE test [
<!ELEMENT test - - (#PCDATA|a)>
<!ENTITY % condition "INCLUDE">
]>
<test>
The following hyperlink is included or
ignored based on the `condition` parameter
entity:
<![ %condition [
<a>Hyperlink text</a>
]]>
</test>
The SGML processor will reproduce the <a>Hyperlink text</a>
text
in its output because the effective value of the %condition
parameter entity is INCLUDE
; if it were IGNORE
instead, the
document is treated as if <a>Hyperlink text</a>
weren't
contained in the document.
Moreover, the document prolog may contain marked sections, too.
In the following document, the attribute declaration will be only
be applied if the condition
parameter entity has the value INCLUDE
:
<!DOCTYPE test [
<!ELEMENT test - - (#PCDATA|a)>
<!ENTITY % condition "INCLUDE">
<![ %condition [
<!ATTLIST test testatt CDATA #REQUIRED>
]]>
]]>
<test testatt="some text">Some other text</test>
A further use case for marked section (CDATA
and RCDATA
marked sections)
is to prevent interpretation of markup delimiters in portions of text.
A marked section
begins with the character sequence <![
,
followed by one or more marked section keywords,
followed by the [
character (possibly with whitespace before and/or after),
followed by the marked section text, and
closed with the character sequence ]]>
.
Keywords have the following meaning:
INCLUDE
means the portion wrapped in the marked section will be included; the marked section effectively is replaced by the wrapped marked section text
IGNORE
means the marked section is ignored, ie. skipped
TEMP
is equivalent to IGNORE
; offers a way to mark up editorial
content such as author comments without having to use IGNORE
CDATA
the marked section text is interpreted as verbatim text without interpreting markup delimiters and entity references
RCDATA
the marked section text is interpreted as verbatim text without interpreting markup delimiters except general and character entity reference start characters
If no keyword is encountered (ie. if the parameter entity is expanded
into blank text or if a construct such as <![[ text ]]>
is used),
the marked section will be treated as if INCLUDE
were specified.
If multiple keywords are encountered (if the parameter entity
expands to multiple keywords, or if multiple parameter entities are
used, each of which expanding into a keyword), if IGNORE
is among
them, the marked section is treated as if it were a IGNORE
section.
That is, IGNORE
has highest precedence, followed by TEMP
, CDATA
,
and INCLUDE
.
Marked sections other than CDATA
and RCATA
marked sections can be
nested up to four levels (ie. marked section text can contain further
marked sections, etc.).
Marked sections can contain any SGML construct valid in the context where the marked section is placed.
Marked sections only apply to top-level SGML constructs, and can't be used within e.g. attributes.
Note that it generally doesn't make sense to create a marked section
and use parameter entities to switch parsing behaviours between CDATA
and either INCLUDE
or IGNORE
because of how CDATA
marked sections
are parsed.
Note that sgmljs.net SGML doesn't support external entities in RCDATA
marked sections; as a workaround, it's possible to pull external
content into a parameter entity, then reference that parameter entity
in the replacement text literal of a general entity, and then reference
that general entity in an RCDATA
marked section.
Literals used for file and other resource names of external entities, declaration sets, notations, or other SGML components are called identifiers in SGML terminology.
Apart from system identifiers which were already used above, SGML also has public identifiers. Public identifiers don't name a physically existing or otherwise accessible resource, but identify a symbolic resource known to the SGML processing system out of band instead of or in addition to a system identifier.
For example, the DTD for HTML 4 (containing declarations for all markup
features understood by web browsers up until HTML 5 became generally
accepted as standard) can be referenced via the public
identifier -//W3C//DTD HTML 4.01//EN
without reference to a physical
location of a DTD file. Using a public, well-known identifier for this
purpose is appropriate since a web browser is usually hard-coded to interpret
a particular markup language (such as HTML, SVG, and MathML), and
isn't designed to render dynamic markup languages at runtime. Using a
system identifier, on the other hand, isn't beneficial here since it
would have to be treated as a constant rather than an actually accessible
resource by browsers anyway.
Since the introduction of SGML, Uniform Resource Locators (URLs) and variants have become widely used for locating and identifying resources on the web, similar to the purpose of system and public identifiers, respectively. Hence, SGML has been extended to allow the use of URLs as both system and public identifiers. While any URL can be used as system identifier as long as a resource can be located using it, public identifiers also need to include an owner identifier as a prefix which identifies a naming authority and a public text type which identifies the role of the virtual resource identified within a DTD. Therefore, URLs for public identifiers (e.g. for formal public identifiers) are required to have the particular syntax described below.
The following examples show how to declare entities with system identifiers, with public identifiers, and with both public and system identifiers, respectively:
<!ENTITY ent PUBLIC "pubid">
<!ENTITY ent SYSTEM "sysid">
<!ENTITY ent PUBLIC "pubid" "sysid">
Declarations for notations with public, with system, and with both public and system identifiers look very similar:
<!NOTATION n PUBLIC "pubid">
<!NOTATION n SYSTEM "sysid">
<!NOTATION n PUBLIC "pubid" "sysid">
In most cases, a system identifier is just a path string such as
"a/b/c" (using the forward-slash character as separator). Like with URLs
used in HTML href
or src
attributes, the path is resolved relative
to the SGML document or DTD from which it is referenced. Hence, a path
string can be used both to reference a file (when processing a local
SGML file) and a resource accessed via e.g. the HTTP protocol (when
accessing a remote SGML document via a network).
If support for formal system identifiers is enabled (which it is by default), a system identifier can also take the form of a string such as
<url base='http://localhost/dir'>file
called a formal system identifier.
The syntax for formal system identifiers resembles the syntax for markup elements with optional attributes. The string used as the pseudo-"element" of a formal system identifier, however, must be declared in a storage manager notation declaration rather than an element declaration, and the pseudo-"attributes" of a formal system identifier must be declared as data attributes of the storage manager notation.
Note that not all kinds of formal system identifiers are supported in all system identifier roles as indicated in subsequent sections.
The osfile
, url
, osfd
, and literal
storage manager
notations are part of the "storage manager notation starter set"
as defined in ISO 10744:1997 (HyTime 2nd ed.), and are also
available for use in third-party SGML systems such as (Open)SP.
sgmljs.net SGML has the additional storage manager
notations exec
, strftime
, strptime
, script
, and sql
as described below. For portability of SGML documents, usage of these storage
manager notations must be declared in a FSI processing instructions
in the declaration set where they are used, whereas osfile
,
url
, osfd
, and literal
must not be declared in a FSI
processing instruction.
The URL storage manager notation provides access to resources
that can be addressed using a Uniform Resource Locator as defined
in RFC 3986/RFC 6974. Note while an URL formal system identifier
can represent resources in a large variety of storage protocols and
representation schemes, sgmljs.net SGML can only access URLs having
the http:
or https:
scheme/protocol.
URL storage manager notation identifiers without an explicit
scheme will be interpreted relative to the URL of the SGML file
in which an entity or other resource making use of the FSI is
declared. In the cases discussed within sgmljs.net SGML
reference manuals, this will either be a file:
URL
or a http:
/https:
URL. The intepretation of a
scheme-less URL storage manager notation identifier is just
the same as with an informal system identifier as used in
the examples for external entity declarations. However, a
URL imposes specific requirements with respect to encoding
of special characters in resource names.
The URL storage manager notations behaves as if declared as follows:
<!NOTATION url
PUBLIC "-//IETF/RFC1738//NOTATION
FSISM PORTABLE Uniform Resource Locator//EN">
<!ATTLIST #NOTATION url
base CDATA #IMPLIED>
The url
storage manger notation can be used
as a derived storage manager notation in a custom
storage manager declaration.
For example, the following declaration declares
the value of the ent
parameter entity as the
URL formed by resolving image1.png
relative
to a site-wide used path for storage of images,
rather than relative to the document in which the
declaration is placed:
<!NOTATION myurl SYSTEM>
<!ATTLIST #NOTATION myurl
superdcn NAME #FIXED url
base CDATA #FIXED "/images">
<?IS10744 FSIDR myurl>
<!ENTITY % ent
SYSTEM "<myurl>image1.png">
The superdcn
attribute has a #FIXED
value of url
, declaring to SGML that it should
be treated as a storage manager notation derived
from the built-in url
storage manager notation.
A storage manager notation identifier can begin with <osfile>
,
followed by a string interpreted as file name; this option is
used to override interpretation of the identifier as URL path
(such as when interpretation of URL percent-encoding is undesired).
A storage manager notation identifier can being with <osfd>
,
followed by a file descriptor number in the range 0-4. The file
descriptor number corresponds to one of those specified by POSIX
(IEEE Std 1003.1-2008) for the standard file descriptors of
a Unix process.
For example, <osfd>0
represents the standard input,
<osfd>1
represents the standard output, and <osfd>2
represents the error output of the Unix process of the
SGML processor parsing the SGML document.
Usage of osfd
storage manager notation system identifiers
is explained in templating.
The system identfier for an entity declaration can be a string
beginning with <literal>
, followed by literal replacement text for the
entity; this form of system identifier is functionally equivalent to using
the the literal text as replacement text in an entity declaration.
For example, the following declarations result in general entities expanding to the same value:
<!entity e "replacement text">
<!entity f system "<literal>replacement text">
The system identifier of a parameter entity declaration,
can contain a string beginning with <exec
, and specifying
an executable Unix shell command in its cmd
pseudo-attribute.
The value of a parameter entity declared with an exec
formal system identifier is the output (Unix standard
output) of the command being executed.
For example, the following declaration establishes the
file-listing
parameter entity containg character data
produced as output by the Unix ls
file listing program
(with *.txt
as parameter):
<!entity % file-listing system "<exec cmd='ls *.txt'">
The program is executed with the directory of the SGML file containing the entity declaration as current working directory.
The program input is declared as the element content of
the exec
storage manager pseudo-element; for example,
the following declaration establishes the str
entity
containing the result of replacing b
by h
on the
input Abba
supplied as Unix standard input to the
tr
program for replacing characters:
<!entity % str system "<exec cmd='tr b h'>Abba">
Note while the value for cmd
can also be specified
using single-quote characters, the declaration syntax
puts restrictions on the simultaneous use of the single-
and double-quote shell meta characters in cmd
values.
Moreover, the exec
storage manager notation isn't
generally available in web processing contexts
for security reasons (and where it is available, will
be restricted in terms of file access and available
commands).
The exec
storage manager notation behaves as if
declared as follows:
<!NOTATION exec
PUBLIC "+//IDN sgml.net//NOTATION
FSISM POSIX Shell Command Language//EN">
<!ATTLIST #NOTATION exec
cmd CDATA #REQUIRED
in CDATA #IMPLIED>
As an alternative to specifying the characters that
comprise the standard input for the command, input
can alternatively specified using the in
parameter.
The in
parameter can have the value <osfd>0
, in
which case the standard input stream of the
SGML processor parsing the SGML document declaring the
parameter entity is used/inherited as standard input for
the command being executed.
The exec
storage manager notation can be used
as a derived storage manager notation in a custom
storage manager declaration. Data attributes declared
for a custom storage manager declaration deriving from
exec
are interpreted as Unix environment variables
to be set in the execution environment of cmd
.
For example, the following declaration declares
the value of the ent
parameter entity as the
output of executing echo $PARAM
on a Unix
command line shell with PARAM
set to the
value MyParam
:
<!NOTATION custom SYSTEM>
<!ATTLIST #NOTATION custom
superdcn NAME #FIXED exec
PARAM CDATA #IMPLIED>
<?IS10744 FSIDR exec custom>
<!ENTITY % ent
SYSTEM "<custom cmd="echo $PARAM" PARAM='MyParam'>">
As shown, the use of the exec
storage manager notation,
as well as any custom storage manager notations deriving
from it, must be declared in a FSI processing instruction
such as
<?IS10744 FSIDR exec custom>
The strptime
and strftime
storage manager notations implement
date parsing and formatting, respectively, according
to POSIX specifications and provide formatting
a Unix epoch time (the time in milliseconds since the "epoch"
eg. since January 1st, 1970) into a human-readable date/time
representation format (using strftime
), or vice-versa
(using strptime
).
As implied by its name, strftime
implements parts of
the POSIX (IEEE Std 1003.1-2008) date and time template
format described as part of the ISO C standard. See
http://pubs.opengroup.org/onlinepubs/009695399/functions/strftime.html
for reference.
A POSIX strftime
/strptime
date and time template
consists of conversion specifiers for the format of
individual date and time components intermixed with
space or interpunction characters and is used for
parsing a given date representation such as
Tue Jan 23, 2010
into the Unix epoch time
numerical representation, as well as formatting
a numerical representation into a human readable
form.
Compared to the full POSIX specification for strftime
and strptime
, only the following subset of conversion
specifiers are supported:
%a
(three-letter day of week in the international locale; eg. "Mon")
%b
(three-letter month of the year in the international locale; eg. "Apr")
%Z
and %z
at the end of the format literal (the literal letter "Z")
%m
(month of the year as a two-digit decimal value with leading zero)
%Y
(year as four-digit decimal value)
%d
(day of the month)
%H
(hour of the day)
%M
(minute of the hour)
%S
(second of the minute)
In addition to the conversion specifier character
%
, the $
character is supported. The %
character
can be an unfortunate choice when a strptime
or strftime
notation manager system identifier literal
is assembled from parameter entities
Note that the system identifier in an entity declaration literal itself isn't subject to parameter entity expansion. Instead, the complete identifier literal construct, including surrounding quotation characters, must be expanded from a parameter entity at the place where the system identifier literal construct is expected in an entity declaration to make use of entity expansion on system identifiers.
strptime
and strftime
date and time template format according
to POSIX specification requires both conversion
specifiers in the date/time pattern as well as the matching text
tokens in the parsed or formatted value to be separated by spaces
or interpunction character tokens.
For example, whereas %d %m %Y
is a valid template, %d%m%Y
is not
since the value to format or parse needs to present conversion specifiers
separated by space or interpunction characters. However, as an
exempt to this rule, in sgmljs.net SGML,
the common sequences %Y%m
and %y%m
are supported when
appearing as the only conversion specifiers (eg. with other text
being either absent or consisting of just leading and/or trailing
boilerplate text)
the ISO 8601 date/time formats and the format used for date/time representation in RFC 2616 (the HTTP/1.1 specification) are supported
As an example for parsing an HTTP date, the parameter entity declaration
<!ENTITY % d SYSTEM "<strptime fmt='%a, %d %b %Y %H:%M:%S %Z'>Tue, 26 Mar 1996 22:20:12 GMT">
will result in the value 827878812 (the number of milliseconds since the given date/time).
For parsing an ISO 8601 combined date/time, an entity declaration similar to the following can be used:
<!ENTITY % d SYSTEM "<strptime fmt='"%Y-%m-%dT%H:%M:%S%z'>1996-03-26T22:20:12Z">
The script
storage manager notation is used to
invoke an ECMAScript function and make the result
of its execution available as the replacement text
for a parameter entity. Note there's no support
for using a script
FSI directly in the declaration
of a general entity; general entities can receive
a script'-genrated value only by copying from a
parameter entity (eg. by referencing a parameter
entity declared using a
script` FSI in the replacement
text for general entity).
An entity declaration using a script
FSI takes either
of the following forms:
<!ENTITY % e SYSTEM "<script>return 'hello'">
or
<!ENTITY % e SYSTEM "<script module='mymodule' function='myfunction'>">
script
FSIs specifying the ECMAScript code text
directly as storage object identifier (as inline
content of the <script>
pseudo-element)
are always executed synchronously, and represent
the ECMAScript expression return
ed from executing
the specified ECMAScript code as it were evaluated
by constructing a ECMAScript Function
object and
invoke it with any storage manager notation data
attributes bound to the accordingly named function
parameters.
Script code text specified inline can be dynamically
executed or can be bundled, depending on the level
of support for the script
for sgmljs.net SGML
application binary. Bundled code is code becoming
part of a sgmljs.net SGML application
binary at build time, rather than at the time of
processing SGML. When an sgmljs.net SGML application
binary is build with bundled functions, inline code
text presented in a script
FSI is compared to
script code text that was provided at built-time,
and is expected to match the prebuilt code text
exactly. For script code to be bundled into
an sgmljs.net SGML application binary, it must
be presented as part of a FSI definition document
in a IS10744 fsidr
processing instruction.
script
FSIs specifying values for the module
and
function
storage manager notation attributes
are interpreted as references to a bundled
(static-like) CommonJS module. The parameter
replacement text is obtained by invoking
the function specified in in the function
data attribute within the module specified
in the module
data attribute.
Like bundled inline-provided code text,
script
FSIs specifying values for module
and function
must be declared as custom
notation storage managers in a
IS10744 fsidr
processing instruction.
Invoked modules are interpreted as ECMAScript
CommonJS modules with the system
object being
declared as member of the ECMAScript global object.
Specifically, data attributes specified as part of the FSI can
be accessed using the system.env
map. The result
of executing module code to become the replacement
text of the declared parameter entity is expected
to be asynchronously written to the system.stdout
text stream via its write()
member function, with
a final call to system.stdout.end()
to continue
processing of the SGML document the declaration is
being part of.
Bundling CommonJS modules and detailed instructions are specific to a target platform.
The sql
storage manager is used to fetch content
from SQL databases into SGML parameter entities; like
<script>
FSIs, <sql>
FSIs can't be used directly in
general entity declarations.
The replacement assigned to the parameter entity is the result of executing an SQL query or statement and formatting it as comma separated values (with the vertical bar character as column delimiter by default for compatibility with markdown table conventions) or as a single atomic text string (if only a single attribute value is fetched in the SQL query).
The sql
storage manager notation, when used in
a IS10744 FSIR
declaration, and in the absence
of an explicit notation declaration, is declared
as
<!NOTATION sql
PUBLIC "SO/IEC 10744:1992//NOTATION SQL Storage Object Specification//EN">
<!ATTLIST #NOTATION sql
connectstr CDATA #IMPLIED
headings (OFF|ON) ON
underline (OFF|ON) ON
colsep CDATA "|">
The storage object identifer of an <sql>
FSI
can contain SQL statements and additional directives
to control output (cf. the common SQL script syntax
supported by the sqlproc
tools)
can reference declared non-standard custom data attributes
of the sql
notation or a custom data storage manager
notation by using the &
ampersand character.
For example, the content of the query-result
parameter entity in the following declaration is the
result of querying the NAMES
database table for all
names with a particular gender_cd
value:
<!ATTLIST sql gender_cd CDATA>
<!ENTITY % query-result
"<sql gender='0'>SELECT NAME FROM WHERE GENDER_CD = '&gender';">
Note while &gender
might look like a general entity reference
here, substitution of &gender_cd
by the actual value for
gender
applies rules for safe/injection-free text
substitution in SQL and will only substitute references
in SQL text literal content (eg. portions enclosed in
single quote characters).
Unlike built-in storage manager notations,
custom storage manager notations are notations with
user-defined names declared as derived from either a
url
or the exec
storage manager notations with (typically)
fixed data attributes.
See exec-storage-manager-notation for an example of a custom storage manager notations.
As already explained, use of the non-standard exec
,
strptime
, strftime
, and script
storage manager notations
must be declared in FSI processing instruction
to make the functionality available to entity
declarations:
<?IS10744 FSIDR exec strptime strftime>
A IS10744 FSIDR
processing instructions must be placed in
every declaration set (document type declaration set
or link process declaration set) making use of
a non-standard storage manager notation.
An IS10744 FSIDR
processing instruction can
reference a FSI definition document using the
syntax shown in the following example:
<?IS10744 FSIDR exec strptime strftime FSIDefDoc="fsidd.declarations">
(where fsidd.declaration
represents a file name and is to
be replaced by the name of the actual file to use).
The referenced FSI definition document is expected to contain custom notation declarations and their associated data attribute declarations. If no FSI definition document is specified, custom storage manager notations are expected to be declared in the document and declaration set wherein the FSI processing instruction is placed.
The form of a IS10744 FSID
processing instruction
with a FSI definition document is used for organizing storage
manager notations as coordinated resource access methods
for larger sets of documents such as web sites.
A public identifier is a sequence of the ASCII characters
A
through Z
, a
through z
, the decimal digits, the characters
(
, )
, +
, ,
(comma), .
(dot), /
, :
, =
, ?
, -
,
and the space, newline and carriage-return characters.
If FORMAL YES
is specified in the SGML declaration,
a public identifier must have the following syntax:
Owner identifier
Either the string "ISO" followed by a string made of digits and
:
(colon) characters, followed by the string //
or the characters +
or -
followed by the string //
, followed
by a string not containing the /
(slash) character, followed by
the string //
Public text class
One of the following strings, directly following the preceding string //
:
CAPACITY
CHARSET
DOCUMENT
DTD
ELEMENTS
ENTITIES
LPD
NONSGML
NOTATION
SHORTREF
SUBDOC
SYNTAX
TEXT
Public text description
A string of characters not containing the /
(slash) character
The string //
Public text designating sequence (for CHARSET
public text) or
public text language (for other public text classes)
A string of characters not containing the /
(slash) character;
The string //
Public text display version
A string of characters not containing the /
(slash) character
Except for CHARSET
public text, the components following public
text description are optional; if the optional components are
omitted, the public identifier ends in the public text description.
The public text display version is optional; if the public text display version is omitted, the public identifier ends in the public text designating sequence or public text language.
As examples for public identifiers, here are the public identifiers of HTML 4 and SGML, respectively:
-//W3C//DTD HTML 4.01//EN
ISO 8879:1986//NOTATION Standard Generalized Markup Language (SGML)//EN
If URN YES
is specified in the SGML declaration,
any public identifier in a markup declaration can also
be declared using an alternative URL syntax
(in addition to the standard syntax for public identifiers
when FORMAL YES
is specified).
Examples for the public identifiers in URN syntax corresponding to those in standard public identifier syntax above are as follows:
urn:publicid:-:W3C:DTD+HTML+4.01:EN
urn:publicid:ISO+8879%3A1986:NOTATION+Standard+Generalized+Markup+Language+(SGML):EN
Each markup declaration is part of a declaration set. A declaration is either a document type declaration set or a link process declaration set (until now we have only considered document type declaration sets, see Templating Reference for link process declaration sets).
Any SGML document prolog consists of one or more named declaration sets, as in the following example:
<!DOCTYPE D [
... markup declarations ...
]>
<!DOCTYPE E [
... markup declarations ...
]>
... document content ...
Note that standard SGML only allows multiple DTDs to occur if either
the CONCUR YES
or LINK IMPLICIT YES
or LINK EXPLICIT YES n
features are active in the SGML declaration.
Via parameter entities, a declaration subset
can reference markup declarations stored in other text files or external
resources. In the following example, markup declarations from the e
parameter entity, declared to contain the content of external-declarations.dtd
,
are included in the DTD:
<!doctype D {
... other declarations ...
<!entity % e "external-declarations.dtd">
%e
]>
... document content ...
The following is an alternative syntax for achieving the same:
<!doctype D system "external-declarations.dtd" {
... other declarations ...
]>
... document content ...
An identifier specification such as system "external-declarations.dtd"
is called the external subset identifier and is interpreted
by SGML such that the markup declarations located
or identified by it are included at the end of the declaration set
being declared.
The set of markup declarations introduced via an external subset
identifier are called the external subset, as opposed
to the internal subset which is the set of markup declaration
that are appear braced within the [
and ]
delimiters of a DTD or LPD.
As a consequence of the external subset being processed after
the internal subset, the internal subset can preempt ("override")
entity declarations (but not other markup declarations) in the
external subset. In the following example, the x
entity is
declared both in the internal-subset-preemption-example.sgm
document and in external-declarations.dtd
:
<!-- external-declarations.dtd -->
<!entity x "This">
<!-- internal-subset-preemption-example.sgm -->
<!doctype d system "external-declarations.dtd" [
<!entity x "That">
]>
<d>&x</d>
The declaration in the internal subset gets processed first and
sets the replacement text value for x
(to That
); the declaration
for x
in external-declarations.dtd
is ignored, because a
declaration for x
is already established when external-declarations.dtd
is processed.
Note that the term "parameter entity" is due to this feature; it emphasizes that the internal subset "parametrizes" or "configures" external subset defaults such that settings more specific to the document instance apply.
SGML allows the external subset to be specified by any kind of identifier, ie. allows it to be specified as a public identifier (or as both a system and public identifier), but sgmljs.net SGML can't resolve public identifiers for external subsets except for the HTML DTD described in SGML Web Reference and requires always a system identifier otherwise.
Though, technically, the result of using an external subset specification is the same as that of using an explicit parameter entity declaration and reference as in the initial example, applications may interpret the syntactic representation as an external subset identifier special; for example, in (the "lax" variant of) templating, only the formal external subset identifier, rather than merely an identifier for a named parameter entity, establishes eligibility of document fragments for inclusion into master documents.
An external subset, either a DOCTYPE
or LINKTYPE
declaration set (explained
in templating) can be specified
as system-specific entity declaration using the following syntax:
<!DOCTYPE x SYSTEM>
or
<!LINKTYPE y ... SYSTEM>
sgmljs.net SGML resolves system-specific external declaration sets by accessing a file having a name derived from an implied declaration set name or document element, looked up in the same directory as the instance file referencing the system-specific external subset.
the declaration text for an external document type declaration subset
referenced via <!DOCTYPE #IMPLIED SYSTEM>
is looked up as e.dtd
, where
e
is the document element name of the instance referencing the external
subset
the declaration text for an external document type declaration subset
referenced via <!DOCTYPE x SYSTEM>
, where x
is a regular element and
document type name, is accessed from the file x.dtd
the declaration text for an external link process declaration setup
referenced via <!LINKTYPE y ... SYSTEM>
is accessed from the file
y.lpd
Processing is aborted with error if a file for declaration text as described can't be accessed.
Note that using SYNTAX NAMECASE GENERAL
has the consequence that the base file
name (x
, or y
in the examples) will be converted to all-uppercase when
accessing a derived file name (but .dtd
or .lpd
file name suffixes
are always used with lowercase letters).
Using an optional SGML declaration, it's possible to specify general properties of a document instance such as its character set, the characters used for markup delimiters, whether, and which, markup minimization features such as tag omission are used, among other things.
An SGML declaration body is a piece of plain text (as described in detail further below) contained either directly in a document instance (in an SGML declaration as the begin of the document instance) or stored in an external entity and referenced via an identifier in an SGML declaration reference at the begin of the document instance.
A conformant SGML processor isn't required to be able to process an SGML declaration; if it isn't, the information contained in an SGML declaration is provided for manual inspection and comparison against the SGML declaration(s) and features supported by the processing system.
sgmljs.net SGML, as much as possible, is designed to avoid the necessity of having to bother with SGML declarations, by
inferring applicable SGML declarations from file name suffixes or other out-of-band information such as HTTP/IANA media types
supporting the use of SGML declaration references in place of full SGML declarations (as described below)
allowing XML declarations to act as SGML declaration.
Note certain sgmljs.net SGML tools/builds lack support for parsing SGML declarations alltogether.
Basic SGML declaration shows the begin of a document instance with the traditional basic SGML declaration, asserting, to the processor, that the document instance is using the reference concrete syntax and other basic settings.
sgmljs.net SGML, while accepting the
Basic SGML declaration below,
doesn't support all features requested in this declaration
(namely certain markup minimization forms requested with
SHORTTAG YES
mostly interesting from a historical
perspective) and can't claim full conformance insofar
as support for these legacy features is mandated for
conformance.
For sgmljs.net SGML, the preferred SGML declaration syntax to use is the one introduced with the WebSGML (ISO 8879:1986 Annex K) revision of SGML as explained further below, which can express presence or absence of legacy features on a more granular level, and hence can more readily represent sgmljs.net SGML's feature set.
The Annex K revision of SGML has extended both
the SGML declaration syntax as well as that of markup
declarations; use of any WebSGML extension is
indicated by using "ISO 8870:1986 (WWW)"
as
minimum data/literal for the initial part of an
SGML declaration body. The WebSGML additions include
essential changes for parsing XML and HTML.
In sgmljs.net SGML, WebSGML additions are available
even in the absence of an SGML declarations.
Note that SGML declaration settings are only discussed insofar as they are supported by sgmljs.net SGML:
sgmljs.net SGML is designed to accept markup in the reference concrete syntax (with supported WebSGML additions to cover XML and HTML as explained below), which convers basically all angle-bracket markup languages, including the fundamental syntax used of XML and HTML
while the SGML declaration in principle allows
redefinition of function characters and delimiters
(such as the <
character), and of reserved names,
this isn't supported by sgmljs.net SGML
only UTF-8-encoded document instances are consistently supported across all regular sgmljs.net SGML tools.
Note that an SGML declaration is rarely used in the basic form given below even in English-speaking countries because of it's restriction of the set of usable characters in the document instance to just the IRV/ASCII characters.
Whitespace (space, tab, and newline characters), as well
as SGML comments (text between --
character sequences)
isn't significant in SGML declarations and only provided
for formatting (in the sense that any whitespace sequence
can be replaced by a single space character).
<!SGML "ISO 8879:1986"
CHARSET
BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
SCOPE DOCUMENT
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
FEATURES
MINIMIZE
DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL NO
APPINFO NONE>
<!-- document prolog and content following here ... -->
"ISO 8879:1986"
(minimum data)
indicates the SGML declaration syntax revision being used
sgmljs.net SGML supports all released revisions
(ie. also "ISO 8879:1986 (ENR)"
or "ISO 8870:1986 (WWW)"
in addition to "ISO 8879:1986"
)
the "ISO 8879:1986 (ENR)"
minimum literal
asserts that the extensions to the SGML declaration syntax
introduced with ISO 8879 Annex J can be used;
see Extended naming rules
"ISO 8879:1986 (WWW)
" asserts that, in addition, those
from ISO 8870 Annex K can be used;
see WebSGML
for sgmljs.net SGML, use of "ISO 8879:1986 (WWW)"
is
always recommended
CHARSET BASESET ...
(document base character set)
this asserts that IRV/ASCII character set is used in the document instance as base character set
the literal is a formal public identifier
containing the text designation sequence ESC 2/5 4/0
representing the escape sequence to (virtually) switch
to the IRV coding system
SGML uses the escape sequences registered with the International register of character sets to be used with escape sequences (which complies with the ISO/IEC 2012:1986 and ISO/IEC 2012:1994, respectively) to identify character sets
sgmljs.net SGML recognizes the public identifiers for character sets as listed in Base Character Set
for an in-depth description, see also Character Sets and Encodings
DESCSET ...
(described set of the document base character set)
represents the "described set" of characters (of the base character set) used in a document instance
in the basic SGML declaration above, contains a list of character ranges
with the following meaning:
0 9 UNUSED
means the character number 0 through 8 (9 characters)
are unused, ie. asserted not to occur in a document instance;
9 2 9
means the character numbers 9 and 10 (2 characters)
should be treated as character number 9 (the tab character), and 10,
respectively (the last text token in a described set portion, if it
is a number, is interpreted to mean that the described character
range is mapped to the range starting at the specified number);
and similar for the other described set portions
the majority of characters of the set is described in portion
32 95 32
, meaning the character range 32 through 127 (95 characters)
are mapped "to themselves" (ie. the range starting at 32)
CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
(capacity set)
contains a public identifier for the reference capacity set
a capacity set contains upper bounds for global run-time capacities the processing system is expected to arrange for, such as the maximal number of entities declared in a document instance
these parameters can also be declared directly in the SGML declaration body; see below for an example
these parameters are ignored by sgmljs.net SGML but are honored and checked against actual use in document prologs by eg. (Open)SP SGML
SCOPE DOCUMENT
(concrete syntax scope)
asserts that the document character set is used both in the prolog as well as in content
for all intents and purposes, SCOPE DOCUMENT
is always
the used setting for document instances processed
with sgmljs.net SGML (SCOPE SYNTAX
is only of historic interest)
for an explanation of the concept of a syntax character set (as opposed to the document character set), see below
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
(concrete syntax)
is a reference to the syntax character public identifier the content of which is explained below
FEATURES MINIMIZE ...
(minimization features)
contains the minimization features asserted to be used by the document instance
note that while SHORTTAG YES
is accepted by sgmljs.net SGML,
the only form of short tag minimization supported by
sgmljs.net SGML is SGML's so-called Null-end tag and NET
-enabling
Start-tag minimization, and only insofar as it is necessary
to support XML-style empty elements
FEATURES LINK ...
(link type features)
contains the link type (LPD) features asserted to be used by the document instance
sgmljs.net SGML supports all link types, but at most 2 simultaneously active implicit link types by default
for an explanation of link type processing, see Templating
FEATURES OTHER ...
(other features)
of the "other" features, sgmljs.net SGML only
supports FORMAL NO
and FORMAL YES
(and WebSGML's
URN YES
as explained with other WebSGML additions)
The concrete syntax fragment (the SYNTAX ...
portion as shown above)
references a public identifier which acts as if it contained the
following code text (which could also be pasted verbatim in place of
the concrete syntax fragment above for the same effect):
SYNTAX
SHUNCHAR
CONTROLS
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
BASESET "ISO 646IRV:1991//CHARSET
International Reference Version (IRV)//ESC 2/8 4/2"
DESCSET 0 128 0
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-_:"
UCNMCHAR ".-_:"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
SYNTAX SHUNCHAR ...
contains a list of shunned characters; for
the purpose of this exposition, these are the
same as those marked UNUSED
in the described
syntax character set
the set of shunned characters includes the
IRV/ASCII control characters (CONTROLS
)
SYNTAX BASESET .../DESCSET ...
contains the syntax-reference character set (the character set used to describe the concrete syntax); the general construction of the described set from character ranges in the base set is analogous to that of the document character set
SYNTAX FUNCTION ...
contains assignments of SGML delimiter function roles to characters
SYNTAX NAMING ...
defines the characters accepted for name tokens and the rules for case-folding
NAMECASE GENERAL YES ENTITIES NO
activates SGML's
traditional case-folding behaviour (namely that elements,
attributes, and all other name tokens except entity names,
for the purpose of validation and tag inference,
are treated as if specified in uppercase letters,
even if specified in lower- or mixed case in content)
SYNTAX DELIM GENERAL ...
contains assignments of characters to delimiter roles
such as needed for the <
character to be interpreted
as a STAGO
(start-element tag open) delimiter
SGMLREF
selects the standard delimiters (which assign
the STAGO
delimiter as described and expected in most
markup language)
SYNTAX DELIM SHORTREF
contains assignments of characters to shortref delimiter roles; note that sgmljs.net SGML doesn`t support user-definable shortref delimiters
SGMLREF
selects the standard shortref delimiters
SYNTAX NAMES SGMLREF
asserts that the standard reserved keywords (such as DOCTYPE
,
ELEMENT
) are used in markup declarations
only SGMLREF
as specified here is supported in sgmljs.net SGML
SYNTAX QUANTITY ...
contains declarations of upper bounds for certain quantities asserted by a document instance, such as the maximal number of attributes declared on an element
these parameters are ignored by sgmljs.net SGML, but are honored and checked against actual use in document prologs by eg. (Open)SP SGML
WebSGML extends the syntax for the delimiter
section in the SGML declaration (ie. adds the
HCRO
and NESTC
delimiters):
DELIM GENERAL SGMLREF
HCRO "&#x" -- ampersand --
NESTC "/"
NET ">"
...
DELIM GENERAL HCRO
the HCRO
delimiter is used in numeric character references
to indicate that the number portion is interpreted as hexadecimal
rather than decimal literal (and to allow the letters A through F
and a through f to occur in numeric character references);
for example, 

represents the U+000A LINE FEED
character
in sgmljs.net SGML, the HCRO
delimiter cannot be redeclared
DELIM GENERAL NESTC
the NESTC
delimiter is introduced to capture XML's empty
element syntax within SGML's definitorial framework with respect
to delimiter characters
while use of the NESTC
delimiter nominally depends on the
definition of NET
delimiter (the null-end tag delimiter),
and changing either the declaration for NESTC
or NET
, or both,
is admitted in SGML in general, sgmljs.net SGML requires
that the delimiters roles for NESTC
and NET
, if assigned
at all, must match those given above in
For all intents and purposes within sgmljs.net SGML, for processing XML-style empty elements (including bogus XML-like empty elements in HTML), the delimiter section should be treated as an opaque string and must have exactly (up to space characters and comments) the form given above.
WebSGML adds a facility to define character entities without using entity declarations as a means to capture XML's and HTML's behaviour in this respect.
For example, the predefined character entities and their represented character numbers for XML are as follows:
ENTITIES
"amp" 38
"lt" 60
"gt" 62
"quot" 34
"apos" 39
See the XML declaration for XML for a complete example of an SGML declaration making use of predefined character entities.
Note that ISO 8879 Annex K requires that all mapped-to characters are contained in the syntax-reference character set, not just the document character set.
WebSGML's extensions to the FEATURES
section
(only mentioned as far as supported in sgmljs.net SGML)
include
unbundling of SHORTTAG
minimization features,
meaning that certain shorttag minimization features can be
switched on individually, rather than just collectively via
FEATURES MINIMIZE SHORTTAG YES
(which switches on
all shorttag minimization features, among them those
only used in historic shortform practices)
MINIMIZE IMPLYDEF ...
options to allow WebSGML to process
document instances lacking declarations for elements,
attributes, and other components declarable in DTDs (which was
generally not possible prior to the Annex K SGML revision)
A WebSGML FEATURES
declaration portion can look as follows:
FEATURES
MINIMIZE
DATATAG NO
OMITTAG NO
RANK NO
SHORTTAG
STARTTAG
EMPTY NO
UNCLOSED NO
NETENABL IMMEDNET
ENDTAG
EMPTY NO
UNCLOSED NO
ATTRIB
DEFAULT YES
OMITNAME NO
VALUE NO
EMPTYNRM YES
IMPLYDEF
ATTLIST YES
DOCTYPE NO
ELEMENT YES
ENTITY NO
NOTATION YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL NO
URN NO
KEEPRSRE YES
VALIDITY NOASSERT
ENTITIES
REF ANY
INTEGRAL YES
FEATURES MINIMIZE SHORTTAG STARTTAG EMPTY NO
asserts that a document instance doesn't make use of empty start-element tags
a value of YES
enables use of empty start-element tags
as explained in Empty element minimization
FEATURES MINIMIZE SHORTTAG STARTTAG ... UNCLOSED NO
asserts a document instance's use of certain historic shortform syntax
for sgmljs.net SGML, these must be NO
FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET
expresses, together with DELIM GENERAL NESTC
and
DELIM GENERAL NET
as explained above, that
an XML-style empty element is recognized as
a short form of specifying the equivalent
sequence of a start- and an end-element tag
note that, in addition, FEATURES MINIMIZE EMPTYNRM YES
must be declared for being able to use
XML-style empty-elements for elements with
declared content EMPTY
sgmljs.net SGML accepts this setting only
in combination with the settings for
DELIM GENERAL NESTC
and DELIM GENERAL NET
as discussed above
FEATURES MINIMIZE SHORRTAG ENDTAG EMPTY NO
asserts that a document instance doesn't make use of empty end-element tags
a value of YES
enables use of empty end-element tags
as explained in Empty element minimization
FEATURES MINIMIZE SHORRTAG ENDTAG ... UNCLOSED NO
these features (also for supporting historic markup
shortform practices) must both have the value NO
for sgmljs.net SGML
FEATURES MINIMIZE ATTRIB DEFAULT
expresses whether default values can be omitted in attribute specifications (ie. with the expectation that default values as declared in the attribute declarations are implied)
FEATURES MINIMIZE ATTRIB OMITNAME
expresses whether attribute names (and the VI
delimiter)
can be omitted in attribute specifications
(ie. as in using name tokens for enumerated attributes,
provided a name token can be be uniquely identified
among those declared on the attributes of an element,
including those declared on #ALL
elements)
FEATURES MINIMIZE ATTRIB VALUE
expresses whether quotation characters (LIT
and LITA
delimiters) can be omitted around attribute values consisting
entirely of name characters, even on undeclared attributes
FEATURES EMPTYNRM YES/NO
expresses whether elements with declared content EMPTY
or implied-EMPTY
elements (those having a content reference
attribute specified) are allowed to have end-element tags
(if YES
)
FEATURES IMPLYDEF ATTLIST YES/NO
expresses that it isn't an error to specify undeclared
attributes (if YES
)
an undeclared attribute is treated as if it were declared
CDATA #IMPLIED
FEATURES IMPLYDEF DOCTYPE YES/NO
expresses that it isn't an error if a document type
declaration is absent from a document instance (if YES
)
if FEATURES IMPLYDEF DOCTYPE YES
is declared, and a
document type declaration is absent, the document instance
is treated as if <!DOCTYPE #IMPLIED SYSTEM>
were present;
the external subset is retrieved by forming a system identifier
from the document element (the first element encountered),
subject to SYNTAX NAMECASE GENERAL
, then
appending .dtd
to it, and interpreting the resulting
string as system identifier relative to the system
identifier of the document instance being processed
note that if FEATURES IMPLYDEF ELEMENT YES
is declared,
then a document type declaration is also allowed to be
absent; but, if in addition, IMPLYDEF DOCTYPE NO
is declared,
an absent document type declaration is treated as if
<!DOCTYPE #IMPLIED>
had been specififed (that is,
its external subset is assumed to be empty)
FEATURES IMPLYDEF ELEMENT YES/NO
expresses that it isn't an error to use an undeclared
element (if YES
or ANYOTHER
)
moreover, expresses that it isn't an error if a document type isn't present; see above
FEATURES IMPLYDEF ELEMENT YES
has the effect that
undeclared elements are implied as if declared - O ANY
FEATURES IMPLYDEF ELEMENT ANYOTHER
expresses that,
in addition, directly nesting undeclared elements isn't
intended for the document instance, and has the effect
that an end-element tag (closing the open element)
before a start-element tag is inferred, if the element
beginning with the start-element would otherwise be
treated as direct child content of an element with
the same element name
FEATURES IMPLYDEF ENTITY YES/NO
expresses that an entity reference for an undeclared
entity is treated as if it were declared system-specific
(ie. declared <!ENTITY ... SYSTEM>
)
the data character content of the entity reference,
if it is used in a parameter or general entity
reference (other than a data text entity reference), is
retrieved by interpreting the entity name (subject
to SYNTAX NAMECASE ENTITY
) as system identifier
FEATURES IMPLYDEF NOTATION YES/NO
expresses that it isn't an error if an undeclared notation is used
ignored by sgmljs.net SGML (notations must always be declared)
FEATURES OTHER URN YES/NO
if FEATURES OTHER FORMAL YES
is declared
(and if ISO 8870:1986 (WWW)
is used as minimum data),
only then can FEATURES OTHER URN YES
also be used
FEATURES OTHER URN YES
enables the URL/URN
syntax for public identifiers (as an alternative
to the standard formal public identifier syntax)
FEATURES OTHER KEEPRSRE YES/NO
this has the effect that SGML's traditional behaviour with respect to suppression of newlines and space characters is switched off
SGML's traditional behaviour (simplified) is, in text records consisting entirely of a start-element tag, character data, and an end-element tag for the same element as in the start-element tag, not to report initial space and trailing space characters, including the trailing RE (newline) character, as data characters, when the start- and end-element tag is for a declared (rather than included) element (ie. because such space and newline characters are considered insignificant, and present for markup text formatting purposes only, in such a way that every markup element is started on its own line)
sgmljs.net SGML only supports YES
as the value for
this setting, meaning that all space and newline characters
will always be reported as character data (note that
behaviour for supporting KEEPRSRE NO
isn't specified
for undeclared elements in WebSGML)
FEATURES VALIDITY TYPE/NOASSERT
asserts that the document is considered valid with respect to the notions expressed next and is used to indicate that validation with respect to the desired validitation level should be performed by sgmljs.net SGML
VALIDITY NOASSERT
means that no content model validation,
but only balancedness-checking (appealing insofar to XML's
wellformedness criteria) is performed; VALIDITY TYPE
means that regular validation and tag inference is
performed
FEATURES OTHER ENTITIES ...
asserts certain characteristics of a document's use of entities appealing to notions introduced with XML
ignored by sgmljs.net SGML
A character set, in SGML, is a mapping of
character numbers to characters. A character
can be referred to by a character name such as those
used by ISO 10646 (aka. Unicode); for example the
character rendered in this text just here: &
,
can be referred to as the character named
AMPERSAND.
Having a name, a character, in SGML and also ISO 10646, is considered a concept existing independently of its conventional graphical rendering and of its character number in a particular character set. But in order to refer to a character, a character number and thus, a definition of a character set must be established in a context where it's impractical to refer to characters by their names.
Hence, while SGML conceptually defines an abstract syntax as a mapping of characters (rather than character numbers) to markup delimiters and other function roles, an abstract syntax can only be expressed as a mapping from character numbers in a character set to markup delimiters and other character roles.
SGML uses the term syntax-reference character set to refer to the character set used in an SGML declaration body for assigning meaning to characters as SGML markup delimiters and other character function roles. Furthermore SGML uses the term concrete syntax to refer to the larger portion of an SGML declaration which contains the mapping, including the declaration of the syntax-reference character set it is using.
Given a concrete syntax, an SGML parser is supposed to assess the characters represented by input character data (using the document character set), then assess whether the concrete syntax defines a delimiter or other role to it (depending on context). For the latter, the SGML parser must map a character presented to it in the document character set to the equivalent character in the syntax-reference character set. The SGML declaration itself doesn't contain a mapping between character sets, hence the SGML parser must rely on build-in character set infomation available to it.
Thus, even if the syntax-reference base character set can be theoretically different from the document base character set (unless if the concrete syntax is embedded in the document instance itself, see below), an SGML parser must still be able to establish a mapping for all characters in the document base character set to a character in the syntax-reference base character set.
SGML was originally devised at a time when a generally accepted character set wasn't yet established for referring to characters. Today, of course, the Universal Coded Character Set (UCS, defined in ISO/IEC 10646, and also known as Unicode) is used for this purpose. Since ISO/IEC 10646 contains over 120.000 code points (character numbers), if it is used as a document instance's base character set (which it should), there just doesn't exist a character set other than UCS itself with the same coverage. For this reason, the distinction between document and syntax-reference character set is irrelevant in practice, but nevertheless requisite knowledge to explain the character set notions in the SGML declaration.
Nevertheless, some of the concepts related to constructing a customized described set by remapping UCS character planes or communicating the purpose of private-use characters can be useful for special applications (ie. precisely because of its coverage, merely specifying UCS as document character set isn't helpful in communicating which character ranges are actually used in a document or required for a particular application, font face or variant, printer or other equipment,, vertical, etc.). Note sgmljs.net SGML doesn't include, however, integrated facilities for checking and/or remapping in regular builds.
For all intents and purposes within sgmljs.net SGML, the Universal Coded Character set is used as a base character set for both the document as well as syntax-reference character set.
In the basic SGML declaration above, the International
Reference Version character set is used (which is the only
character set supported by regular sgmljs.net SGML builds
in addition to UCS). International Reference Version or
IRV is the term used in international specifications to
refer to the ISO/IEC 646 character set known as US-ASCII
(technically, the version referenced by ISO/IEC:1983
differs from US-ASCII, and from that referenced by
ISO 646IRV:1991
, but not in a way relevant to SGML).
IRV contains the first 128 code points of UCS,
which uses the usual encoding of the US-ASCII character
set into bytes interpreted as binary numbers for its
character numbers.
A character number is different from a representation of a character as a bit pattern within a particular character encoding such as UTF-8 (even though a character number can be algorithmically determined from an UTF-8 representation); rather, character numbers and character sets are purely organizational concepts to identify and otherwise refer to characters in general.
For being able to read an SGML declaration, of course, an SGML parser must be able to interpret the bytes of an entity according to an encoding of a character set. The character set encoding of a document instance can't be meaningfully stated in its SGML declaration (if it has one), if the SGML declaration is part of the document instance itself (ie. because the SGML declaration must use the same encoding as the document instance, hence the processor still needs additional out of band information with respect to the encoding, else wouldn't be able to read the SGML declaration).
Having to deal with SGML declarations, which are a somewhat archaic, but in any case inconvenient format for conveying processing parameters to an SGML processor, only to find out that such a basic fact about a document instance as its character encoding can't effectively be expressed using it is considered unfortunate. Moreover, having to resort to out-of-band information such as command line processing options or similar in order to being able to parse a document is considered inadequate for SGML, especially with respect to SGML's attractiveness for archival purposes where it is deemed desirable to manifest a document character encoding.
Within established SGML technology, there are the following plausible mechanisms to inform the SGML parser about the character encoding used by a document instance and of bootstrapping an SGML parser into applying a desired character decoding:
SGML itself normatively references ISO 2012 code switching techniques as code extension facility; using this mechanism allows an SGML processor to start out in a mode where it accepts only IRV/ASCII characters, and then (virtually) "switches" into the desired mode of accepting eg. an UTF-8 encoding of UCS, based on the designating sequence of the public identifier of the document's base character set (subject to the ISO 8879 provisions with respect to delimiter recognition, this can also be extended to other multi-byte encodings as well)
using a wrapper document instance and refer to a main
document instance via an entity reference; the reference
is declared as an external entity using a formal
system identifier which admits additional metadata such as
character encoding and similar; (eg. cf. the bctf
parameter of ISO 10744 extended facilities eg. FSIDR);
this technique can be used with eg. (Open)SP, but
isn't supported with sgmljs.net SGML right now
even though sgmljs.net SGML supports ISO 10744 FSIDR
in general
using an SGML catalog, which can associate an SGML declaration to a document instance without having to place an SGML declaration or declaration reference in a document instance)
sgmljs.net SGML only supports the first mechanism of the discussed techniques, and only for UTF-8; the alternatives are discussed for information only.
Note since sgmljs.net SGML uses ISO 2012 to learn about the character encoding of a document, the listing of supported character sets given below includes designating sequences which represent a UTF-8 character encodings.
Note when a character encoding is changed, this has no bearing on the character set, ie. the character numbers used in numeric character references; this is apparent eg. with HTML, which even when served over HTTP with ISO-8859 encoding (which used to be the standard encoding before HTML5) can contain numeric character references that still will be interpreted as UCS code points.
For an overview of ISO 2022, please refer to ECMA-35 Character Code Structure and Extension Techniques, which is identical to ISO/IEC 2022:1994 and made available by ECMA International for public access.
The following public identifiers are recognized character sets by sgmljs.net SGML:
ISO 646:1983//CHARSET
International Reference Version (IRV)//ESC 2/5 4/0
ISO 646IRV:1991//CHARSET
International Reference Version (IRV)//ESC 2/8 4/2
ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UTF-8 Level 1//ESC 2/5 4/7
ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UTF-8 Level 2//ESC 2/5 4/8
ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UTF-8 Level 3//ESC 2/5 4/9
ISO Registration Number 177//CHARSET
ISO/IEC 10646:2003 UCS with implementation Level 3//ESC 2/5 2/15 4/6
ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with implementation level 3//ESC 2/5 2/15 4/6
Note that even though UCS-4, as used in the last public identifier/designation sequence in the list, denotes an alternate UCS encoding, this particular public identifier is interpreted to denote just the UCS character set, and acts exactly the same as the UTF-8 designation sequences.
The document character set is the term used by SGML to refer to the character set used by a document instance.
The syntax-reference character set is the character set used for an SGML concrete syntax declaration. As shown in the basic declaration for SGML, the concrete syntax fragment can conceptually (but not actually) be stored in another entity and then referenced from the SGML declaration.
As also discussed in the introduction, hence a concrete syntax needs it's own character set definition, independent of the document character set used by a document instance referencing the concrete syntax.
If a concrete syntax definition isn't referenced via a public identifier, but is presented embedded in the SGML declaration code text of the document itself, then it of course must be using the same character set as the document character set of the document which it is part of.
For all intents and purpose, a character number as used in sgmljs.net SGML is a single UCS (ISO 10646 or, equivalently, Unicode) code point, independently of the document encoding (such as UTF-8) being used. Apart from character numbers in the SGML declaration, UCS code points are also used in character entity references in a document instance.
The SGML declaration code text itself is always using (just) the IRV/ASCII character set, and when referring to a character number, is using either a character literal (when the character number/code point is contained in IRV/ASCII and is a graphic character) or, alternatively, a character number (when it is not, or when the author chooses to use a number rather than a literal for specifying it).
In SGML, the definition of permitted characters for names and name tokens
of generic identifiers of elements, attributes, and notations, and,
values of attributes with declared value ENTITY
, ID
,
IDREF
, NAME
, NMTOKEN
, or NOTATION
, and the attributes
with declared value ENTITIES
, IDREFS
, NAMES
, or
NMTOKEN
for specifiying multiple space-separated name tokens,
and
entity names
is controlled uniformly in the NAMING
section of the used
SGML declaration, meaning the declaration is applicable for all
these names and name-like constructs at once.
SGML distinguishes name start characters, which can appear as the first character of a name token, from name characters, which can appear anywhere in a name token. Specifically, the digits can't start a name token. By default, unless more characters are added to the set of name characters or name start characters, respectively, as explained next, the upper and lowercase IRV letters are accepted as name start characters, while the digits are accepted in addition as name characters.
Name tokens can be normalized into an uppercase form
for the purpose of validation and tag inference (and
output, if any), provided that the mapping can be specified
for each character or character range individually (eg. rather
than by reference to Unicode case conversion procedures),
using the LCNMSTRT
, UCNMSTRT
, LCNMCHAR
, and
UCNMCHAR
parameters.
These parameters each contain either a quoted parameter literal containing characters (as character literals), or a space-separated list of character numbers or character ranges, and have the following meaning:
LCNMSTRT
(lowercase name start characters)
describes lowercase characters used as name start characters in addition to the IRV lowercase and uppercase letters
UCNMSTRT
(uppercase name start characters)
describes exactly as many characters as LCNMSTRT
,
and contains the uppercase letter for the corresponding
lowercase letter in LCNMSTRT
at the same position
LCNMCHAR
(lowercase name characters)
describes lowercase characters used as name characters
in addition to the lowercase name start characters in
LCNSTRT
UCNMCHAR
(uppercase name characters)
describes exactly as many characters as UCNMCHAR
and contains the uppercase letter for the corresponding
lowercase letter in UCNMCHAR
at the same position
The SGML declaration admits case folding/canonicalization to be switched on for these two groups of name tokens individually
entities (SYNTAX NAMECASE ENTITY YES/NO
)
and for all other name token uses (SYNTAX NAMECASE GENERAL YES/NO
)
but not for more granular subsets of the other name tokens.
When extended naming rules are used, as indicated
by the "ISO 8879:1986 (ENR)"
(or the
"ISO 8870:1986 (WWW)"
) minimum literal/data,
the naming section of an SGML declaration can contain
the additional NAMESTRT
and NAMECHAR
parameters.
Moreover, extended naming rules enable character ranges to be used with naming parameters, rather than just lists of individual character numbers.
A naming section making use of extended naming rules can look as follows:
NAMING LCNMSTRT ""
UCNMSTRT ""
NAMESTRT ""
LCNMCHAR ""
UCNMCHAR ""
NAMECHAR ".-_:"
The effect of using NAMESTRT
and NAMECHAR
, respectively,
is that the declaration is treated as if the value for NAMESTRT
had been used in both LCNMSTRT
and UCNSTRT
; likewise, the
NAMECHAR
value is interpreted as if the parameter literal
had been used in both LCNMCHAR
and UCNMCHAR
.
When using extended naming, the literals for the LCNMSTRT
,
USNMSTRT
, LCNMCHAR
, and UCNMCHAR
parameters are left
empty in the SGML declaration.
Note that SGML has a built-in preference for the uppercase
form of characters if NAMECASE GENERAL YES
is applied, in that
the lowercase and uppercase letters are always considered both name and name start characters (cf. ISO 8879 Clause 189); ie. these cannot be excluded from the set of admissable characters for name tokens at all
likewise, the definition of a larger character set for name tokens
versus those in the SGML reference concrete syntax
and the associated lowercase-to-uppercase mapping rules
afforded by the LCNMSTRT
, UCNMSTRT
, LCNMCHAR
, and UCNMCHAR
SGML declaration parameters (and NAMESTRT
/NAMECHAR
introduced
with the extended naming rules according to ISO 8879 Annex J)
can only contain characters in addition to the IRV/ASCII
letters and digits; in particular, for the letters in IRV/ASCII,
no customized uppercase letter can be mapped; this is enshrined in
ISO 8879 Clause 198, 22 which reads "A character assigned to
LCNMCHAR, UCNMCHAR, LCNMSTRT, or UCNMSTRT cannot be an LC Letter,
UC Letter, Digit, RE, RS, SPACE, or SEPCHAR"; consequently,
a lowercase IRV/ASCII letter is always case-folded with
build-in SGML rules when NAMECASE GENERAL YES
is effective
SGML's uppercase bias isn't affected by
ISO 8879 Annex J,
which only alters the rule for the NAMING
production so as
to allow character ranges instead of just single character
specifications, and also adds NAMESTRT
and NAMECHR
as
a short-form, but not essential form of specifying the
values of LCNMSTRT
, UCNMSTRT
, LCNMCHAR
, and UCNMCHAR
when
the upper- and lowercase variants are identical.
Note that uppercase is only conceptually the preferred form, ie. for the purpose of defining SGML validation in the specification text. SGML applications are free to output or otherwise convey markup as they see fit. ISO 8879 doesn't put any constraints on these, nor defines a canonical SGML processing application or API apart from defining the line-oriented ESIS SGML representation used by ISO 8879's test case suite (the Grove in-memory representation of SGML, which in many ways is the predecessor to W3C's DOM API, isn't part of ISO 8879).
Hence, whether uppercase or lowercase is used internally by an
SGML processor, or whether the processor makes this
distinction at all, doesn't have any consequences for
the externally observable behaviour of SGML applications
as far as ISO 8879 is concerned. For example, sgmljs.net
SGML has an option to output HTML markup in lowercase form,
while being able to process SGML with NAMECASE GENERAL YES
without restrictions.
The SGML for HTML5, applied by sgmljs.net SGML by
default when eg. processing .html
files or an
entity with text/html
media type fetched via HTTP,
is explained in detail in the HTML5.1 DTD reference.
As an example for a plausible SGML declaration,
the following start of an HTML document contains
a variant for the (historic) SGML declaration
for HTML 4.0. It differs from the
official SGML declaration of HTML 4.01
only in its use of the extended WebSGML FEATURES
declaration syntax to match
actual HTML usage.
Note the declaration as shown here doesn't declare HTML predefined entities for space reasons, and thus can't be used for HTML content containing entity references; the variant of the SGML declaration for HTML5 for use with the full W3C HTML5.2 DTD does contain these and other declaration, though.
<!SGML "ISO 8879:1986 (WWW)"
-- based on the SGML declaration for HTML 4.01 --
CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- SURROGATES --
57344 1056768 57344
CAPACITY SGMLREF
TOTALCAP 150000
GRPCAP 150000
ENTCAP 150000
SCOPE DOCUMENT
SYNTAX
SHUNCHAR CONTROLS 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 127
BASESET "ISO 646IRV:1991//CHARSET
International Reference Version
(IRV)//ESC 2/8 4/2"
DESCSET 0 128 0
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-_:"
UCNMCHAR ".-_:"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
HCRO "&#x" -- ampersand --
NESTC "/"
NET ">"
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTCNT 120 -- increased for HTML 5 --
ATTSPLEN 65536 -- These are the largest values --
LITLEN 65536 -- permitted in the declaration --
NAMELEN 65536 -- Avoid fixed limits in actual --
PILEN 65536 -- implementations of HTML UA's --
TAGLVL 100
TAGLEN 65536
GRPGTCNT 150
GRPCNT 150 -- increased for HTML 5 --
FEATURES
MINIMIZE DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG
STARTTAG EMPTY NO
UNCLOSED NO
NETENABL IMMEDNET
ENDTAG EMPTY NO
UNCLOSED NO
ATTRIB DEFAULT YES
OMITNAME YES
VALUE YES
EMPTYNRM YES
IMPLYDEF ATTLIST YES
DOCTYPE NO
ELEMENT YES
ENTITY NO
NOTATION NO
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL NO
URN NO
KEEPRSRE YES
VALIDITY NOASSERT
ENTITIES
REF ANY
INTEGRAL NO
APPINFO NONE
>
<!DOCTYPE HTML>
...
CHARSET ...
the declaration uses the same CHARSET BASESET
character
set declaration as the SGML declaration for HTML 4.01; the
declaration admits most Unicode characters; in practice,
any valid UTF-8 byte sequence in content or attribute values
is admitted, but note sgmljs.net doesn't enforce
this and will admit any byte in content or attributes,
whether it forms part of a valid UTF-8 byte sequence or
not (except those having a special
delimiter role in SGML such as the <
character)
SYNTAX ... BASESET ... DESCSET ...
the declaration restricts generic identifiers (used for
element, attribute, notation, declaration set, and entity
names) to the IRV (ASCII) characters A through Z, a through z,
the decimal digits; in addition, the characters .
(dot),
-
(hyphen), _
(underscore), and :
(colon)
are accepted as the second and subsequent characters, but
not as the first character of generic identifiers (note,
however, that the SGML processor doesn't enforce these rules)
FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET
,FEATURES MINIMIZE EMPTYNRM YES
the declaration is suited for inclusion of SVG and/or MathML as
HTML 5 "foreign elements"; specifically, XML-style empty elements
are accepted in SVG and MathML; moreover, HTML 5 "self-closing" tags
are accepted in HTML content as well (irrespective of whether
those are declared "void" element in the HTML 5 spec); also,
declarations for NESTC
and NET
characters have been declaerd
as appropriate for XML; note that XML predefined entities are not declared
IMPLYDEF ELEMENT YES
undeclared elements will be accepted and treated as if they were
declared <!ELEMENT elmt - - ANY>
; according to this setting, only contents
of elements which are declared in a DTD will be validated
(but see FEATURES OTHER VALIDITY NOASSERT
which effectively switches
off any content model validation)
QUANTITY SGMLREF ...
quantities have been adapted so that (Open)SP SGML processing tools can process DTDs for HTML 5; quantity declarations are not required for sgmljs.net SGML
FEATURES OTHER KEEPRSRE YES
newline and carriage return characters will be preserved in content
(rather than being interpreted according to SGML rules for omissible
whitespace); note that sgmljs.net doesn't support another setting
for KEEPRSRE
FEATURES OTHER VALIDITY NOASSERT
no content model or attribute validation is performed; only balancedness
of start-element and end-element tags is checked, including checks
for elements with declared content EMPTY
, which may or may not have
end-element tag or a "self-closing" start-element tag
The following declaration is applied by default if a
file being processed has an .xml
suffix, or begins
with an XML declaration, or begins with this
declaration.
The XML Fifth Ed. and the XML 1.1 specification revisions have extended the set of admissible characters in name tokens to cover allmost all UCS code points, hence the declaration text for these revisions can be much shorter. However, these newer XML specifications are widely considered not representative of actual XML usage, and no official ISO/IEC 8879 SGML declarations for these newer XML versions has been released yet.
For interoperability, only use of the official SGML declaration for XML 1.0 exactly (up to whitespace and comments) as given here is supported, and use of variant declarations is strongly discouraged, until a new offical or at least generally accepted SGML declaration for XML is established.
<!SGML "ISO 8879:1986 (WWW)"
-- SGML Declaration for XML 1.0 --
-- from:
Final text of revised Web SGML Adaptations Annex (TC2) to ISO 8879:1986
ISO/IEC JTC1/SC34 N0029: 1998-12-06
Annex L.2 (informative): SGML Declaration for XML
changes made to accommodate validation are noted with 'VALID:'
--
CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with implementation
level 3//ESC 2/5 2/15 4/6"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- surrogates --
57344 8190 57344
65534 2 UNUSED -- FFFE and FFFF --
65536 1048576 65536
CAPACITY NONE -- Capacities are not restricted in XML --
SCOPE DOCUMENT
SYNTAX
SHUNCHAR NONE
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with implementation
level 3//ESC 2/5 2/15 4/6"
DESCSET
0 1114112 0
FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING
LCNMSTRT ""
UCNMSTRT ""
NAMESTRT
58 95 192-214 216-246 248-305 308-318 321-328
330-382 384-451 461-496 500-501 506-535 592-680
699-705 902 904-906 908 910-929 931-974 976-982
986 988 990 992 994-1011 1025-1036 1038-1103
1105-1116 1118-1153 1168-1220 1223-1224
1227-1228 1232-1259 1262-1269 1272-1273
1329-1366 1369 1377-1414 1488-1514 1520-1522
1569-1594 1601-1610 1649-1719 1722-1726
1728-1742 1744-1747 1749 1765-1766 2309-2361
2365 2392-2401 2437-2444 2447-2448 2451-2472
2474-2480 2482 2486-2489 2524-2525 2527-2529
2544-2545 2565-2570 2575-2576 2579-2600
2602-2608 2610-2611 2613-2614 2616-2617
2649-2652 2654 2674-2676 2693-2699 2701
2703-2705 2707-2728 2730-2736 2738-2739
2741-2745 2749 2784 2821-2828 2831-2832
2835-2856 2858-2864 2866-2867 2870-2873 2877
2908-2909 2911-2913 2949-2954 2958-2960
2962-2965 2969-2970 2972 2974-2975 2979-2980
2984-2986 2990-2997 2999-3001 3077-3084
3086-3088 3090-3112 3114-3123 3125-3129
3168-3169 3205-3212 3214-3216 3218-3240
3242-3251 3253-3257 3294 3296-3297 3333-3340
3342-3344 3346-3368 3370-3385 3424-3425
3585-3630 3632 3634-3635 3648-3653 3713-3714
3716 3719-3720 3722 3725 3732-3735 3737-3743
3745-3747 3749 3751 3754-3755 3757-3758 3760
3762-3763 3773 3776-3780 3904-3911 3913-3945
4256-4293 4304-4342 4352 4354-4355 4357-4359
4361 4363-4364 4366-4370 4412 4414 4416 4428
4430 4432 4436-4437 4441 4447-4449 4451 4453
4455 4457 4461-4462 4466-4467 4469 4510 4520
4523 4526-4527 4535-4536 4538 4540-4546 4587
4592 4601 7680-7835 7840-7929 7936-7957
7960-7965 7968-8005 8008-8013 8016-8023 8025
8027 8029 8031-8061 8064-8116 8118-8124 8126
8130-8132 8134-8140 8144-8147 8150-8155
8160-8172 8178-8180 8182-8188 8486 8490-8491
8494 8576-8578 12295 12321-12329 12353-12436
12449-12538 12549-12588 19968-40869 44032-55203
LCNMCHAR ""
UCNMCHAR ""
NAMECHAR
45-46 183 720-721 768-837 864-865 903 1155-1158
1425-1441 1443-1465 1467-1469 1471 1473-1474
1476 1600 1611-1618 1632-1641 1648 1750-1764
1767-1768 1770-1773 1776-1785 2305-2307 2364
2366-2381 2385-2388 2402-2403 2406-2415
2433-2435 2492 2494-2500 2503-2504 2507-2509
2519 2530-2531 2534-2543 2562 2620 2622-2626
2631-2632 2635-2637 2662-2673 2689-2691 2748
2750-2757 2759-2761 2763-2765 2790-2799
2817-2819 2876 2878-2883 2887-2888 2891-2893
2902-2903 2918-2927 2946-2947 3006-3010
3014-3016 3018-3021 3031 3047-3055 3073-3075
3134-3140 3142-3144 3146-3149 3157-3158
3174-3183 3202-3203 3262-3268 3270-3272
3274-3277 3285-3286 3302-3311 3330-3331
3390-3395 3398-3400 3402-3405 3415 3430-3439
3633 3636-3642 3654-3662 3664-3673 3761
3764-3769 3771-3772 3782 3784-3789 3792-3801
3864-3865 3872-3881 3893 3895 3897 3902-3903
3953-3972 3974-3979 3984-3989 3991 3993-4013
4017-4023 4025 8400-8412 8417 12293 12330-12335
12337-12341 12441-12442 12445-12446 12540-12542
NAMECASE
GENERAL NO
ENTITY NO
DELIM
GENERAL SGMLREF
HCRO "&#x"
-- Ampersand followed by "#x" (without quotes) --
NESTC "/"
NET ">"
PIC "?>"
SHORTREF NONE
NAMES
SGMLREF
QUANTITY
NONE -- Quantities are not restricted in XML --
ENTITIES
"amp" 38
"lt" 60
"gt" 62
"quot" 34
"apos" 39
FEATURES
MINIMIZE
DATATAG NO
OMITTAG NO
RANK NO
SHORTTAG
STARTTAG
EMPTY NO
UNCLOSED NO
NETENABL IMMEDNET
ENDTAG
EMPTY NO
UNCLOSED NO
ATTRIB
DEFAULT YES
OMITNAME NO
VALUE NO
EMPTYNRM YES
IMPLYDEF
ATTLIST YES
DOCTYPE NO
ELEMENT YES
ENTITY NO
NOTATION YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL NO
URN NO
KEEPRSRE YES
VALIDITY NOASSERT
ENTITIES
REF ANY
INTEGRAL YES
APPINFO NONE
SEEALSO "ISO 8879//NOTATION Extensible Markup Language (XML) 1.0//EN"
>
<!DOCTYPE ...>
...
BASESET
see notes above
SYNTAX ... BASESET ... DESCSET ...
irrespective of the range restrictions expressed in the declaration the processor admits all valid XML 1.0 Fifth Edition (or XML 1.1) generic identifiers
This declaration also has the following notable settings:
SYNTAX NAMECASE GENERAL NO
SYNTAX NAMECASE ENTITY NO
SYNTAX FEATURES MINIMIZE OMITTAG NO
SYNTAX FEATURES MINIMIZE RANK NO
SYNTAX FEATURES MINIMIZE IMPLYDEF DOCTYPE NO
SYNTAX FEATURES MINIMIZE IMPLYDEF ELEMENT YES
SYNTAX FEATURES MINIMIZE IMPLYDEF ATTLIST YES
SYNTAX FEATURES MINIMIZE IMPLYDEF ENTITY NO
SYNTAX FEATURES MINIMIZE SHORTTAG ATTRIB OMITNAME NO
SYNTAX FEATURES MINIMIZE SHORTTAG ATTRIB VALUES YES
FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET
FEATURES OTHER VALIDITY NOASSERT
FEATURES OTHER KEEPRSRE YES
added requirements ISO 8879/NOTATION Extensible Markup Language (XML) 1.0//EN
If processing a file with suffix .sgm
, a declaration with the following
settings is applied:
SYNTAX NAMECASE GENERAL YES
SYNTAX NAMECASE ENTITY NO
, `
FEATURES MINIMIZE RANK YES
,
FEATURES MINIMIZE OMITTAG NO
FEATURES MINIMIZE IMPLYDEF DOCTYPE NO
FEATURES MINIMIZE IMLYDEF ELEMENT YES
FEATURES MINIMIZE IMPLYDEF ATTLIST YES
FEATURES MINIMIZE IMPLYDEF ENTITY NO
FEATURES MINIMIZE EMPTYNRM YES
FEATURES MINIMIZE SHORTTAG ATTRIB DEFAULT YES
FEATURES MINIMIZE SHORTTAG ATTRIB OMITNAME NO
FEATURES MINIMIZE SHORTTAG STARTTAG NETENABL IMMEDNET
FEATURES OTHER VALIDITY NOASSERT
FEATURES OTHER KEEPRSRE YES
FEATURES OTHER FORMAL YES
FEATURES OTHER URN YES
In addition, the default SGML declaration has the following link processing related settings:
LINK
SIMPLE YES 99
IMPLICIT YES
EXPLICIT YES 2
These settings enable link processing and templating.
If processing a file with suffix .md
, a declaration with the
same settings as an .sgm
file is applied. In addition,
predefined entities for HTML are active
the strings <file:
, <http:
, <mailto:
, and <:
, #
, ##
,
###
, and others are declared as SHORTREF
delimiters
(note sgmljs.net doesn't support custom SHORTREF
declarations,
but the presentation of markdown as a SHORTREF
application
nominally requires these declarations, even though versions
of (Open)SP SGML don't enforce their presence when declaring
shortref maps)
Note that, as with .sgm
files, validation isn't enabled
(left at its default of NOASSERT
).
Using a public declaration reference such as the following
<!SGML MARKDOWN PUBLIC "+//IDN sgml.net//SD Markdown//EN">
in place of a full SGML declaration (where the MARKDOWN
declaration set name, but not the +//IDN sgml.net//SD Markdown//EN
public identifier can be chosen arbitrarily) enables
markdown processing from any processed file, not just
those ending in .md
.