Central SGML parsing module receiving input record events (from a Recordmanager and emitting SAX events to the registerered DocumentHandler, {@link DTDHandler}, and LexicalHandler implementations.

Parsing is invoked by appending records (text lines) to be parsed into an internal markup buffer, using append_markup(). append_markup() will normally call parse_markup() to scan for markup constructs (via delimit_markup()) and fire SAX events on recognized and complete markup constructs. Firing SAX events will trigger registered content and other handlers; notably, markup validation is performed by sending events to the registered Validator (which must implement DocumentHandler and other interfaces). parse_markup() will fire as many events as it can scan from the markup buffer, so a single call to append_markup() may result in multiple SAX events being emitted. Conversely, delimit_markup() or parse_markup() may find that the markup buffer contains incomplete markup (such as unterminated character data or incomplete markup declarations), in which case no events will be emitted for that run of parse_markup().

In general, the internal markup buffer is maintained such that it may contain incomplete markup at any time, in which case parse_markup() will attempt to continue where it left of as new markup is fed.

Markup parsing can be paused in case the input contains an entity reference (to a parsed entity) which can't be resolved. When parsing is paused, new input records can still be fed into the markup buffer (which however may grow indefinitely until the entity on which it is "stalled" gets defined eventually). This feature is used for entity references that are defined programmatically, rather than declared in entity declarations, and is also used for asynchronously fetching referenced external parsed entities (see expand_general_entity_references() for details).

Tokenizer stores markup definitions (parsed from the document prolog) to a registered Markupdefinitions object. In some cases, Tokenizer also requests back metadata from Markupdefinitions (notably, on SGML CDATA and RCDATA declared content, but other cases as well).

Tokenizer is instantiated with a downstream content handler which will receive events in the input stream, and a validation handler which will also receive those events before the primary content handler; Validator is expected to generate omitted tags, or produce an error event on its registered content handler such that the events generated from Tokenizer are valid according to the SGML declaration and document type declaration for a given instance. So if Tokenizer and Validator feed events to the same content handler, that content handler will receive a valid event stream, whereas if Validator feed events into e.g. a null handler, it will receive only events actually parsed from the input source.

Tokenizer and Validator need to operate on the same outputstack as Validator pushes/pops elements with omitted tags to/from there.

Constructor

new Tokenizer(sgmldecl, encoder, errorhandler, locator, resolver, entitydefinitions, markupdefinitions, attributechecker, docinfo, outputstack, contenthandler, lexhandler, validationcontenthandler, validationlexhandler, dtdhandler, recordmanager, saxeventmanager, prologhandler)

Parameters

Name Type Description
sgmldecl Sgmldecl
encoder Markupencoder
errorhandler Errorhandler
locator Locator
resolver SystemSpecificEntityResolver
entitydefinitions Entitydefinitions
markupdefinitions Markupdefinitions
attributechecker AttributeChecker
docinfo Docinfo
outputstack Object
contenthandler DocumentHandler
lexhandler LexicalHandler

LexicalHandler to send startDtd()/endDtd() events to

validationcontenthandler DocumentHandler
validationlexhandler LexicalHandler
dtdhandler DTDHandler

DTDHandler to send notations and declarations of unparsed entities (from declaration sets) to

recordmanager Recordmanager
saxeventmanager Saxeventmanager
prologhandler Prologhandler

Name Description
active_lpd_names Contains optional linktype names to activate.
bundledfunctions Generated/prebuilt module containing functions configured at build time.
expected_external_dtd_subset_identifier Contains a (system) identifier that the DTD is expected to reference.
ignore_clear_unresolved_entity_name Flag indicating that calls to clear_unresolved_entity() (issued from markdown) should be ignored.
no_stalling_at_end_of_markup Flag to indicate that - `parse_markup()` should proceed, rather than postpone and reparse, on characters at the end of `markup_buffer` - a fatal error should be generated if encountering incomplete markup is detected.
no_stalling_on_unresolved_entity Public flag indicating that `parse_markup()` should proceed, rather than return if it is stalled on a link entity reference.
running_as_template_subprocessing_context Flag to indicate that processing is performed on a template (in a template subprocessing context).
system_specific_implied_lpd_names Contains the (comma-separated or single) name(s) of forced additional LPDs treated as if declared as system-specifc LPDs following actual present LPD in the document prolog, unless an LPD with that name is present in the actual prolog explicitly.
system_specific_implied_lpd_result_document_type_names Contains the (comma-separated) names of result doctypes of implied LPDs, if any, such that a name represents the result doctype of the implied explicit link process at the corresponding position in system_specific_implied_lpd_names.
system_specific_implied_lpd_source_document_type_names Contains the (commad-separated) names of source doctypes of implied LPDs, if any, such that a name represents the source doctype of the implied explicit link process at the corresponding position in system_specific_implied_lpd_names.

Name Description
append_markup Appends text to markup_buf and calls parse_markup() to process markup_buf.
clear_unresolved_entity_name Resets the unresolved entity name which causes `parse_markup()` to stall, if any.
configure Configuration method.
derive_storage_manager_notation_metadata Copies attribute declarations from super-storage manager notations to derived storage manager notations, traversing through the chain of superdcn-values of the argument storage manager notation name.
end_markup Used after call(s) to `append_markup()` to indicate that no more text will be fed via `append_markup()`.
get_unresolved_entity_name Returns the unresolved entity name which causes `parse_markup()` to stall, if any.
is_data_specification_attribute Returns whether supplied named attribute is a DATA specification attribute of elementtype.
parse_attlist_decl Parses an attribute declaration from a string containing a markup declaration and calls store_element_attribute_decl()/store_data_attribute_decl() with the extracted details for each declared element/attribute combination.
parse_notation_decl Parses and stores a declaration from a string containing a notation declaration.
remove_linefeed_at_end_of_markup Helper function to remove last character from markup_buf if it is a newline character.
reset Resets internal state.
set_debug_emit_ctx_token Sets the string printed as part of debugEmit messages.
set_unresolved_entity_name Sets an arbitrary entity name as unresolved, which will cause the next call to `parse_markup()` to stall.
switchoff_stalling_at_end_of_markup Sets no_stalling_at_end_of_markup.
switchoff_stalling_on_unresolved_entity Sets no_stalling_on_unresolved_entity.

Member Details

active_lpd_names :string

Contains optional linktype names to activate.

If active_lpd_names is set to a string containing one or more (space- or comma-separated) link type names referring to simple or implicit links, then those will be activated (at most one implicit link is activated, though); if additionally, target_document_type_name is set, then LPDs needed to result in that doctype name will be activated as described above and simple links and a single implicit link according to active_lpd_names will be activated on the ultimate output of the explicit link chain; active_lpd_names may also contain explicit link names, in which case explicit link definitions with matching names will be used as part of the link chain; it's not an error if link processes named by tokens in active_lpd_names aren't actually declared, but if a link process declared in the prolog matches a name of active_lpd_names, these will always be activated and it's and error if those can't be activated (such as when multiple implicit links are attempted to be activated

append_markup(text)

Appends text to markup_buf and calls parse_markup() to process markup_buf.

Parameters

Name Type Description
text string
bundledfunctions

Generated/prebuilt module containing functions configured at build time.

clear_unresolved_entity_name()

Resets the unresolved entity name which causes parse_markup() to stall, if any.

configure(args)

Configuration method.

Parameters

Name Type Description
args Object.<string, string>

Map of configuration properties

derive_storage_manager_notation_metadata(sm_notation_name)

Copies attribute declarations from super-storage manager notations to derived storage manager notations, traversing through the chain of superdcn-values of the argument storage manager notation name.

At most two levels of superdcn-chain are traversed.

Also sets the superdcn value of the argument notation to the ultimate storage manager notation if a three-step derivation is performed, so that the conditions expected by rewrite_custom_into_base_storage_manager_notation() (eg. superdcn-value contains a natively supported storage manager notation) is met.

Parameters

Name Type Description
sm_notation_name string

derived storage manager notation to assert attribute declarations for

end_markup()

Used after call(s) to append_markup() to indicate that no more text will be fed via append_markup(). This drains the output context of any outstanding tag validation/inference or dispatch events, and also demarcates the end of an input entity.

expected_external_dtd_subset_identifier :string

Contains a (system) identifier that the DTD is expected to reference. If this is #IMPLIED, this indicates that we're expecting the base DTD to be <!DOCTYPE ... SYSTEM> (where the doctype can be #IMPLIED or specified), or that we can omit the prolog alltogether.

get_unresolved_entity_name()

Returns the unresolved entity name which causes parse_markup() to stall, if any.

ignore_clear_unresolved_entity_name

Flag indicating that calls to clear_unresolved_entity() (issued from markdown) should be ignored.

The purpose is to ensure that no asynchronous fetches are triggered from markdown cleanup() processing.

Used from Recordhandler.

is_data_specification_attribute()

Returns whether supplied named attribute is a DATA specification attribute of elementtype. qprivate

no_stalling_at_end_of_markup :number

Flag to indicate that

  • parse_markup() should proceed, rather than postpone and reparse, on characters at the end of markup_buffer
  • a fatal error should be generated if encountering incomplete markup is detected.

Set by end_markup() before invoking the final call to parse_markup() to flush remaining characters in markup buffer, if any.

no_stalling_on_unresolved_entity

Public flag indicating that parse_markup() should proceed, rather than return if it is stalled on a link entity reference.

Used from end_markup() and other places to force-flush the markup buffer. Will cause markdown roundtrip text to be written out to the output.

Note: the outputhandler might receive more than one consecutive characters() event in this case

parse_attlist_decl(declaration_set_name, decl)

Parses an attribute declaration from a string containing a markup declaration and calls store_element_attribute_decl()/store_data_attribute_decl() with the extracted details for each declared element/attribute combination.

Also used for attlists declared in LPDs

Parameters

Name Type Description
declaration_set_name string

name of the document type in which the attribute list declaration occurs

decl string

declaration text to parse as attribute list declaration

parse_notation_decl(declaration_set_name, decl)

Parses and stores a declaration from a string containing a notation declaration.

Like parse_entity_decl(), this is called whithout expanded parameter entities and uses its own effective markup declarations recording code.

Parameters

Name Type Description
declaration_set_name string

name of declaration set in which the notation decl occurs

decl string

code text of the notation declaration to parse

Returns

the parameter-entity-expanded declaration that was processed

remove_linefeed_at_end_of_markup()

Helper function to remove last character from markup_buf if it is a newline character. Factored out from end_markup() for consistency of results of async vs sync processing (ie. used prior to the final call to parse_markup() which, for async case, performs some of what end_markup() does for sync case).

reset()

Resets internal state.

running_as_template_subprocessing_context :number

Flag to indicate that processing is performed on a template (in a template subprocessing context).

Used for warning about entity references to system-specific entities with lower or mixed case names if SYNTAX NAMECASE GENERAL and SYNTAX NAMECASE ENTITY don't match.

set_debug_emit_ctx_token()

Sets the string printed as part of debugEmit messages.

set_unresolved_entity_name()

Sets an arbitrary entity name as unresolved, which will cause the next call to parse_markup() to stall. Used by Recordhandler to force stalling parse_markup() during markdown cleanup.

switchoff_stalling_at_end_of_markup()

Sets no_stalling_at_end_of_markup.

switchoff_stalling_on_unresolved_entity()

Sets no_stalling_on_unresolved_entity.

system_specific_implied_lpd_names :String

Contains the (comma-separated or single) name(s) of forced additional LPDs treated as if declared as system-specifc LPDs following actual present LPD in the document prolog, unless an LPD with that name is present in the actual prolog explicitly. The names in system_specific_implied_lpd_source_document_type_names and system_specific_implied_lpd_result_document_type_names, resp., contain the corresponding source and result doctypes.

Implicit link processes don't have a corresponding source or target doctypes, and are specified as the last (or only) name in system_specific_implied_names, without a corresponding name in either system_specific_implied_lpd_source_document_type_names or system_specific_implied_lpd_result_document_type_names.

For example, a set of consistent values for system_specific_implied_lpd_names, system_specific_implied_lpd_source_document_type_names, and system_specific_implied_lpd_result_document_type_names is

LNK1,LNK2,LNK DOC,OUT OUT,OUT2

which will by treated as though the following link process names were present in the SGML prolog (where the document types names are assumed to be present)

<!DOCTYPE DOC [ ... ]> <!DOCTYPE OUT [ ... ]> <!DOCTYPE OUT2 [ ... ]> <!LINKTYPE LNK1 DOC OUT SYSTEM> <!LINKTYPE LNK2 OUT OUT2 SYSTEM> <!LINKTYPE LNK OUT2 #IMPLIED SYSTEM>

system_specific_implied_lpd_result_document_type_names :String

Contains the (comma-separated) names of result doctypes of implied LPDs, if any, such that a name represents the result doctype of the implied explicit link process at the corresponding position in system_specific_implied_lpd_names.

system_specific_implied_lpd_source_document_type_names :String

Contains the (commad-separated) names of source doctypes of implied LPDs, if any, such that a name represents the source doctype of the implied explicit link process at the corresponding position in system_specific_implied_lpd_names.