HtmlPrag

HtmlPrag provides permissive HTML parsing and emitting capability to Scheme programs. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. HtmlPrag emits "SHTML," which is an encoding of HTML in SXML, so that conventional HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov's SSAX-based HTML parser, HtmlPrag provides a permissive tokenizer, but also attempts to recover structure. HtmlPrag also includes procedures for encoding SHTML in HTML syntax.

The HtmlPrag parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. HtmlPrag's handling of errors is intended to generally emulate popular Web browsers' interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse "pragmatic."

HtmlPrag also has some support for XHTML, although XML namespace qualifiers are currently accepted but stripped from the resulting SHTML. Note that valid XHTML input is of course better handled by a validating XML parser like Kiselyov's SSAX.

HtmlPrag requires R5RS, SRFI-6, and SRFI-23.

See documentation for more info...

To be added to the moderated scheme-announce email list, ask neil@neilvandyke.org .

The current version of HtmlPrag is 0.16 (2005-12-18).

You can download file htmlprag.scm, Scheme source code.

You can download file htmlprag.html, documentation in HTML format.

You can download file htmlprag.pdf, documentation in PDF format.

You can download file htmlprag.plt, a packaging for PLT Scheme.

Site © Copyright Neil Van Dyke, All Rights Reserved   neil@neilvandyke.org    Legal