SGML

API Reference

Container and parsing/manipulation routines for properties representing SGML/WebSGML declaration options.

This module consists of the following parts:

  • transcription of the ISO 8879 SGML declaration grammar with Annex K overrides and additions into a regular grammar, using C preprocessor and POSIX regexp syntax for expansion into a regexp constant (sgmldecl-regexpes.js)

  • predicates and functions for syntax-checking a string containing a SGML declaration against above grammaer, and for matching/classifying against HTML/XML/SGML standard declaration syntax subgrammars

  • member fields representing (significant) SGML declaration properties; only those declaration properties are represented which can be meaningfully changed; SGML properties that can't be changed are hard-coded in (variants of) grammar rules

The functions initialize_from_arguments() unconditionally overrides any preexisting properties whereas initialize_defaults(), initialize_for_xml10(), initialize_from_decl() do so only if a respective property hasn't be set before.

The following preprocessor definitions influence code generation:

SGMLDECL_PARSING_SUPPORT : enables general support for SGML declaration parsing (rather than just declaration reference parsing)

XML_SGMLDECL_PARSING_SUPPORT : enable extra regexp checks for the SGML declaration for XML 1.0 (as of WebSGML Annex K); when this isn't defined, parsing will just check the "added requirements" portion of the SGML declaration for presence of "ISO 8870//NOTATION Extensible Markup Language (XML) 1.0//EN", and will enable fixed XML settings irrespective of whatever content the SGML declaration has otherwise on success, ie. other parts of the SGML declaration get ignored; this is switchable via a macro definition so that the excessively long regep matching the XML SGML declaration doesn't have to be split in smaller parts for mawk and other awks with regexp/automaton size limitations (as is done in parse() and generate_error_message() functions for the regexpes for general syntax and supportedness checks).

SGMLDECL_FEATURES_MINIMIZE_IMPLYDEF_ELEMENT_ANYOTHER : enable support of validation of amply-tagged documents without element declarations

Constructor

new Sgmldecl(errorhandler, locator)

Parameters

Name Type Description
errorhandler Errorhandler
locator Locator

Name Description
added_requirement_public_ids Contains zero or more public identifiers representing "added requirements" as per ISO 8879 Annex K.
features_minimize_emptynrm WebSGML EMPTYNRM option allowing EMPTY elements to have end-tags.
features_minimize_implydef_attlist Contains WebSGML's option value related to whether attributes can be implied.
features_minimize_implydef_doctype Contains WebSGML's option value related to whether doctypes can be implied.
features_minimize_implydef_element Contains WebSGML's option value related to whether elements can be implied.
features_minimize_implydef_element_anyother Contains WebSGML's option value related to whether implied elements can be directly nested.
features_minimize_implydef_entity Contains WebSGML's option value related to whether references to undeclared entities can be implied as system-specific entities.
features_minimize_omittag Contains whether tag omission support is enabled.
features_minimize_rank Contains whether the RANK feature is enabled.
features_minimize_shorttag_attrib_default Contains whether attributes with default values may be omitted (YES/NO).
features_minimize_shorttag_attrib_omitname Contains whether attributes may be omitted if the value token is a unique token value among the token lists (enumerations) of all declared attributes (YES/NO).
features_minimize_shorttag_attrib_values Contains whether attribute values must be placed in (single or double quotes) (NO) or quotes around attribute values may be omitted (YES) (for certain attribute types).
features_minimize_shorttag_starttag_netenabl WebSGML NETENABL option controlling which element instances may have null-end tags.
features_other_formal Contains the option value to SGML's FEATURES OTHER FORMAL clause to control that public identifiers must be FPIs (or URNs if feature_other_urn is set).
features_other_keeprsre Contains the option value to the WebSGML FEATURES OTHER KEEPRSRE clause to control that trailing and leading whitespace shouldn't be ignored in mixed content.
features_other_urn Contains the option value to WebSGML's FEATURES OTHER URN clause to control that public identifiers must be formal public identifiers (or FPIs if feature_other_formal is set).
features_other_validity Contains the option value to the WebSGML `FEATURES VALIDITY` clause.
predefined_entity_replacement_text Maps predefined character entities to numerical character references.
public_declaration_reference Contains a WebSGML declaration reference public identifier, if any, supplied instead of a SGML declaration body.
syntax_namecase_entity Contains whether names of entities should be converted to uppercase (if YES).
syntax_namecase_general Contains whether names of elements, attributes should be converted to uppercase (if YES).

Name Description
initialize_defaults Populates settings from defaults unless already set.
initialize_for_html Sets sgmldecl features suitable for HTML (HTML5).
initialize_for_markdown Sets sgmldecl features suitable for when the markdown public declaration reference has been used in SGML declaration.
initialize_for_markdown_without_validation Sets sgmldecl features suitable for when markdown is implied by file name suffix.
initialize_for_xml10 Sets values for sgmldecl_ vars suited for XML 1.
initialize_from_arguments Populates settings from same-named configuration properties in supplied map.
initialize_from_decl_or_decl_reference Initializes SGML decl from a (possibly multiline) string containing either a full standalone SGML declaration or a declaration reference.
initialize_predefined_entities_for_html Initialize predefined_entity_replacement_text for HTML5 .
initialize_predefined_entities_for_xml Initializes predefined_entity_replacement_text for XML.
is_markdown_sgmldecl_publicid Returns whether the supplied string represents the markdown syntax public identifier for use as SGML declaration reference.
is_supported_xml_decl Returns whether the supplied string contains (just) a supported XML declaration.
is_xml_sgmldecl_publicid Returns whether the supplied string represents the XML 1.
save_to_arguments Populates a map with configuration properties representing declaration settings (the inverse of initialize_from_arguments()).

Member Details

added_requirement_public_ids :Object.<string, string>

Contains zero or more public identifiers representing "added requirements" as per ISO 8879 Annex K.

Doesn't map to anything, is just used as hash set

In particular, this may contain the string

ISO 8879/NOTATION Extensible Markup Language (XML) 1.0//EN`

(as per Annex L) to control that entity references should end with REFC (semicolon).

TODO: may also control other XMLism not otherwise expressible via WebSGML such as XML's allgroup restrictions

TODO: use this also for HTML5 (ending rules for R?CDATA declared content)

Note: can't use sole "added_requirements" as member name, will get macro-expanded by sgmldecl-regexpes

features_minimize_emptynrm :string

WebSGML EMPTYNRM option allowing EMPTY elements to have end-tags.

YES : allows elements with declared content EMPTY to have end-tags (end tags are controlled by end-tag omission rules)

NO : disallows such (which is SGML's but not our standard behaviour)

Note that, in addition, VALIDITY TAG-TYPE makes EMPTY elements require end-tags (which is wanted for XML but not HTML).

Moreover, EMPTYNRM YES makes SGML accept end-tags for elements which are implied-EMPTY by having a value for a CONREF attribute specified.

features_minimize_implydef_attlist :string

Contains WebSGML's option value related to whether attributes can be implied.

Note this also (for lack of a better option value in WebSGML's extended SGML declaration) enables that a given element type name may occur in more than one attribute list declaration.

features_minimize_implydef_doctype :string

Contains WebSGML's option value related to whether doctypes can be implied.

features_minimize_implydef_element :string

Contains WebSGML's option value related to whether elements can be implied.

Possible values are YES or NO.

If SGMLDECL_FEATURES_MINIMIZE_IMPLYDEF_ELEMENT_ANYOTHER_SUPPORT has been enabled at build time: if IMPLYDEF ELEMENT ANYOTHER has been specified in the SGML declaration or otherwise, this must have the value YES and features_minimize_implydef_element_anyother has also must have the value YES.

features_minimize_implydef_element_anyother :string

Contains WebSGML's option value related to whether implied elements can be directly nested.

If this doesn't contains YES (the default), then elements with implied element declarations can be directly nested; if this contains YES (andfeatures_minimize_implydef_elementalso containsYES`), then, upon encountering a start-element tag for an undeclared element, an end-element tag will be implied if an instance of the same element is open at the top of the output stack.

features_minimize_implydef_entity :string

Contains WebSGML's option value related to whether references to undeclared entities can be implied as system-specific entities.

features_minimize_omittag :string

Contains whether tag omission support is enabled.

Note this also controls whether tag omission flags are expected in an element declaration.

TODO: set this to YES as per SGML defaults (hence the frequent to switch this off for XML via a dedicated sgmldecl)

features_minimize_rank :string

Contains whether the RANK feature is enabled.

features_minimize_shorttag_attrib_default :string

Contains whether attributes with default values may be omitted (YES/NO).

features_minimize_shorttag_attrib_omitname :string

Contains whether attributes may be omitted if the value token is a unique token value among the token lists (enumerations) of all declared attributes (YES/NO).

This also controls whether tokens in attlist declarations specifying a token lists (enumerations) have to be unique to an element ("YES", as in SGML without the WebSGML adapations), or not ("NO")

TODO: this isn't actually enforced

features_minimize_shorttag_attrib_values :string

Contains whether attribute values must be placed in (single or double quotes) (NO) or quotes around attribute values may be omitted (YES) (for certain attribute types).

If this is NO, then features_minimize_shorttag_attrib_omitname must also be NO.

TODO: why?/only valid for legacy markdown-awk)

features_minimize_shorttag_starttag_netenabl :string

WebSGML NETENABL option controlling which element instances may have null-end tags.

ALL : allows null end-tags on all elements (unsupported)

IMMEDNET : allows null end-tags on elements without content

NO : disallows any null end-tags (unsupported)

This and the features_minimize_emptynrm option, as well as the settings for the NESTC and NET function characters are WebSGML adaptions for XML; with respect to XML's empty elements, the NESTC and NET delimiter settings and the NETENABLE feature work in concert as follows:

  • NESTC / and NET > make SGML generate the expected parse events for XML empty elements like <bla/>: NESTC ends the start-element tag which SGML sees as <bla/ and enables the null-end delimiter >; > then ends the element
  • technically, it would then be possible to have text such as <bla/further stuff>, but NETENABLE IMMEDNET allows null end-tags only for elements with empty content (basically, synthesizing XML parsing rules)

Currently, only the IMMEDNET or an unset value is supported; the SGML null end-tag feature is always switched on, and supported only in combination with the net-enabling start-tag close feature as explained above for XML. Moreover, NET and NESTC cannot be redefined, but will have always the hard-coded values stated above.

features_other_formal :string

Contains the option value to SGML's FEATURES OTHER FORMAL clause to control that public identifiers must be FPIs (or URNs if feature_other_urn is set).

features_other_keeprsre :string

Contains the option value to the WebSGML FEATURES OTHER KEEPRSRE clause to control that trailing and leading whitespace shouldn't be ignored in mixed content.

Only NO (or no value at all) is supported.

TODO: support YES, test cases for NO

features_other_urn :string

Contains the option value to WebSGML's FEATURES OTHER URN clause to control that public identifiers must be formal public identifiers (or FPIs if feature_other_formal is set).

features_other_validity :string

Contains the option value to the WebSGML FEATURES VALIDITY clause.

TYPE : (the default for SGML) performs full SGML DTD checking

NOASSERT : (the default for XML) doesn't type-check elements and requires MINIMIZE OMITTAG NO

When LEGACY_VALIDITY_SEMANTICS is enabled

TAG : means that only wellformedness is checked/required; this checks that the instance is fully tagged and that empty elements have an end-tag (suited for XML) a DOCTYPE isn't required

TAG-TYPE : performs full SGML DTD checking and ensures that EMPTY elements have an end-tag (suited for XML with XML DTDs) (can be set on command line to have validation etc. turned on for backward compat with most test cases; see also EMPTYNRM)

Note: this appears syntactically as FEATURES OTHER VALIDITY in the SGML decl

Note: Annex K (official WebSGML std) defines validation different from the above description: Annex K has only the NOASSERT and TYPE options here and performs validation according to IMPLYDEF options: in particular,

  • the document element of an instance may be different from the DOCTYPE (ie. if the doctype is implied)
  • WebSGML will always perform validation if a declaration is present; otherwise it will behave according to the IMPLYDEF options

With our implementation right now, the TAG option suppresses any type checking, even if an appropriate declaration is present,

WebSGML's values map into TAG, TYPE, or TYPE, respectively, according to the following rules:

  • if VALIDITY is explicitly set to TYPE, this is taken as-is; otherwise
  • if MINIMIZE OMITTAG is set to YES, this forces TYPE; otherwise
  • if IMPLYDEF ELEMENT is set to YES, this forces TAG; otherwise
  • TAG-TYPE is set
initialize_defaults()

Populates settings from defaults unless already set.

Default values represent those for XML parsing.

initialize_for_html()

Sets sgmldecl features suitable for HTML (HTML5).

This mostly just enables predefined html entities for now.

TODO: universally useful settings for validation etc. have yet to be worked out depending on use case ie. emptynrm, omittag

initialize_for_markdown()

Sets sgmldecl features suitable for when the markdown public declaration reference has been used in SGML declaration.

The instance being parsed is assumed to use a doctype for HTML so that VALIDITY TYPE makes sense.

TODO: maybe IMPLYDEF ELEMENT should be enabled by default so validation isn't enforced?

initialize_for_markdown_without_validation()

Sets sgmldecl features suitable for when markdown is implied by file name suffix.

Doesn't switch on validation.

initialize_for_xml10()

Sets values for sgmldecl_ vars suited for XML 1.0 processing.

initialize_from_arguments(args)

Populates settings from same-named configuration properties in supplied map.

Parameters

Name Type Description
args Object.<string, string>

the map of configuration properties to populate from

initialize_from_decl_or_decl_reference(decl)

Initializes SGML decl from a (possibly multiline) string containing either a full standalone SGML declaration or a declaration reference.

Can be used (just) for declaration reference parsing when the SGMLDECL_PARSING_SUPPORT macro isn't set at build-time.

Parameters

Name Type Description
decl string

string gleaned from input stream and looking like a SGML declaration or declaration reference

initialize_predefined_entities_for_html()

Initialize predefined_entity_replacement_text for HTML5

initialize_predefined_entities_for_xml()

Initializes predefined_entity_replacement_text for XML.

TODO: predefined entities aren't populated from an SGML decl; rather, the most commonly used named character entities are hard-coded

is_markdown_sgmldecl_publicid(s): boolean

Returns whether the supplied string represents the markdown syntax public identifier for use as SGML declaration reference.

Parameters

Name Type Description
s string

argument string

Returns

boolean
is_supported_xml_decl(decl): boolean

Returns whether the supplied string contains (just) a supported XML declaration.

Parameters

Name Type Description
decl string

argument string

Returns

boolean
is_xml_sgmldecl_publicid(s): boolean

Returns whether the supplied string represents the XML 1.0 declaration syntax public identifier for use as SGML declaration reference

Parameters

Name Type Description
s string

argument string

Returns

boolean
predefined_entity_replacement_text :Object.<string, string>

Maps predefined character entities to numerical character references.

TODO: re-check: maps <. &quot etc. to themselves

public_declaration_reference :string

Contains a WebSGML declaration reference public identifier, if any, supplied instead of a SGML declaration body.

This is currently used for assuring that the markdown shortref delimiters are usable (and thus markdown processing is possible; in addition, an instance needs to have a DOCTYPE which includes markdown shortref maps via a parameter entity reference)

TODO: use/recheck-if used to enforce XML 1.0 (and possibly XML 1.1) naming rules

save_to_arguments(args)

Populates a map with configuration properties representing declaration settings (the inverse of initialize_from_arguments()).

Parameters

Name Type Description
args Object.<string, string>

the map to populate

syntax_namecase_entity :string

Contains whether names of entities should be converted to uppercase (if YES).

syntax_namecase_general :string

Contains whether names of elements, attributes should be converted to uppercase (if YES).

TODO: check namecase_general on notations, name tokens ...