State machine for SGML validation and tag inference.

This is implemented as a SAX content and lexical handler designed to accept events with omitted tags, and outputs inferred tags on its registered contenthandler. The input stream is validated against (definitions from) a DTD, if the input has one, and according to the sgmldecl.features_other_validity_... variables.

Validator must be configured with a stack to use for elements which is typically shared with an upstream Tokenizer. Validator learns the document type definition name to validate or infer tags against from received start_dtd() events.

Implements DocumentHandler, LexicalHandler.

Constructor

new Validator(sgmldecl, errorhandler, locator, markupdefinitions, attributechecker, outputstack, contenthandler, docinfo)

Parameters

Name Type Description
sgmldecl Sgmldecl
errorhandler Errorhandler
locator Locator
markupdefinitions Markupdefinitions
attributechecker AttributeChecker
outputstack Object
contenthandler DocumentHandler
docinfo Docinfo

Name Description
attempt_to_close_potentially_completed_element Like close_potentially_completeed_element(), but doesn't fail (returning a false-y value instead).
check_acceptance Returns whether supplied content token is accepted at the current state.
close_potentially_completed_element Shared subroutine to pop a single element of positions_stack if at a terminal state.
comment No-Op implementation.
endCDATA No-Op implementation.
endDocument Called after the last pushxml() call.
endElement Called before an end-element tag is dispatched.
endEntity No-op impelementation.
endIGNORE No-Op implementation.
endINCLUDE No-Op implementation.
endRCDATA No-Op implementation.
open_contextually_implied_element_below_document_element Opens the single optional element (if any) acceptable at the context position, which must be directly below the document element.
populate_docinfo_exclusion_exceptions Routine used by try_accept_token used to establish (docinfo)exclusion_exceptions on code paths where try_accept_not_excluded_token isn't called.
processingInstruction No-op implementation.
reset Resets internal state.
set_debug_emit_ctx_token Sets the string printed as part of debugEmit messages.
set_document_element_name Explicitly sets document element name ("root element") to expect/infer.
set_document_type_name Explicitly sets document type name (of those availlable in markupdefinitions) to validate against.
startCDATA No-Op implementation.
startDocument Called before any other events .
startDTD Implementation of lexical handler callback to capture the document type definition name to validate against.
startEntity Validate a data entity as PCDATA.
startIGNORE No-Op implementation.
startINCLUDE No-Op implementation.
startRCDATA No-Op implementation.

Member Details

attempt_to_close_potentially_completed_element(): string

Like close_potentially_completeed_element(), but doesn't fail (returning a false-y value instead).

Returns

string

the element of the top-most completed content model (ie. the element to close), or a false-y value/empty string

check_acceptance(content_token): string

Returns whether supplied content token is accepted at the current state.

Wraps try_accept_token for export.

Parameters

Name Type Description
content_token string

the content token to check for acceptance

Returns

string

the state/position at which content_token is accepted, if content_token is accepted as proper subelement of a modelgroup, or the string " " (one blank) if an element content_token is accepted via inclusion, or the string " " (two blanks) if a #PCDATA content_token is accepted as declared content, or the empty string, if content_token isn't accepted

close_potentially_completed_element(): string

Shared subroutine to pop a single element of positions_stack if at a terminal state.

Doesn't pop outputstack nor calls dispatch_element; checking whether the popped element is expected in a given context, and checking validity of end-tag omission on it must be done by caller.

Note that this routine differs from close_definitely_completed_elements() by also closing potentially finished modelgroups, so this is only usefull when it is known from context that the current input token isn't accepted in the topmost content model state, or when the current input is an end-element tag, or the end of the document.

Returns

string

the element of the top-most completed content model (ie. the element to close), or fails

comment()

No-Op implementation.

endCDATA()

No-Op implementation.

endDocument()

Called after the last pushxml() call.

endElement()

Called before an end-element tag is dispatched.

The routine is expected to either change modelgroup states if needed (close open elements etc.) such that the context element (xmlelement) can be popped and dispatched after, or terminate with sgml_fatal_error().

Note that suppress_dispatch doesn't play into here, as this feature is only used on fully-tagged input XML from markdown frontend and thus doesn't run into here

endEntity()

No-op impelementation.

endIGNORE()

No-Op implementation.

endINCLUDE()

No-Op implementation.

endRCDATA()

No-Op implementation.

open_contextually_implied_element_below_document_element()

Opens the single optional element (if any) acceptable at the context position, which must be directly below the document element.

(eg because closing the document element as the only alternative is clearly not plausible in the presence of a content_token start-element event to accomodate).

populate_docinfo_exclusion_exceptions()

Routine used by try_accept_token used to establish (docinfo)exclusion_exceptions on code paths where try_accept_not_excluded_token isn't called.

processingInstruction()

No-op implementation.

reset()

Resets internal state.

set_debug_emit_ctx_token()

Sets the string printed as part of debugEmit messages.

set_document_element_name(name)

Explicitly sets document element name ("root element") to expect/infer.

By default, the document type name (either explictly set or implictly captured from the first start_dtd event) is considered the document element name. + Needed for reusing already established metadata (markupdefinitions) in an otherwise fresh processing context for simple templating.

Parameters

Name Type Description
name string
set_document_type_name(name)

Explicitly sets document type name (of those availlable in markupdefinitions) to validate against. Must be set prior to receiving the first start_dtd() and/or start_element() event; otherwise the name transferred in the first start_dtd will be used by default.

Parameters

Name Type Description
name string

the document type name to set

startCDATA()

No-Op implementation.

startDocument()

Called before any other events

startDTD()

Implementation of lexical handler callback to capture the document type definition name to validate against.

startEntity()

Validate a data entity as PCDATA.

startIGNORE()

No-Op implementation.

startINCLUDE()

No-Op implementation.

startRCDATA()

No-Op implementation.