NAME

cs::HTML - support for parsing and generating HTML markup


SYNOPSIS

use cs::HTML;


DESCRIPTION

This module supplies methods for decomposing HTML text into a data structure and also methods for converting a perl-friendly data structure into HTML text.

A cs::HTML object does not represent a tag but a token source from which HTML tokens may be read. See OBJECT CREATION below.


GENERAL FUNCTIONS

singular(tag)

Test whether a tag needs a closing </tag> partner.

mkTok(tag,attrs,subtokens...)

Compose the arguments into a hashref of the form:

{ TAG => tag, ATTRS => attrs, TOKENS => [ subtokens... ] }

If omitted, attrs defaults to {}.

tokUnfold(tokens...)

Take a hand constructed array of HTML tokens, which may contain a mix of tokens in the form

{ TAG => tag, ATTRS => attrs, TOKENS => [ subtokens... ] }

and the form

tagattrssubtokens... ]

and return a list of these tokens converted into the first form, suitable for analysis.

tokFlat(tokens...)

Take some HTML tokens and return a string containing the textual component, with all markup discarded.

tok2a(doindent,tokens...)

take a list of HTML tokens and return the HTML text, nicely indented if doindent = 1. If omitted, doindent defaults to 0. In a scalar context returns a single string with the HTML in it. In an array context returns an array of strings, each an HTML token.

tok2s(doindent,sink,tokens...)

Take a list of HTML tokens and write the HTML text to sink (a cs::Sink), nicely indented if doindent = 1. If omitted, doindent defaults to 0.

doesTagIndent(tag,currentindent)

If tag is one of the special ones we don't indent, return 0. Otherwise, return currentindent.

unamp(entity)

Convert a character entity name entity (as is found inside &entity; in HTML text) or number (of the form #n) into the corresponding character. Returns the character, of the entity name unchanged if unrecognised.

raw2html(text)

Convert plaintext to HTML, converting special characters like < into character entities. Also recognised is nroff-style bold and underline (cBSc and _<BS>c respectively).

quoteQueryField

Replace saces with +. Replace special characters with %xx escapes. Used to massage a string for use with a GET HTML query.

href(tagline,url,target)

OBSOLETE. Emit HTML text for an <A HREF= anchor.

news2html(text)

Convert text into HTML text in a heuristic fashion, recognising markup and URLs. Hoped to be handy for mail/news->HTML conversion.

msgid2html(message-id)

Emit HTML text with a <A HREF=news:message-id anchor.

editMarkUp(editsub,tokens...)

Walk the tokens, handing each to the subroutine editsub for manipulation.

grepMarkUp(grep,tokens...)

Seach the tokens, returning an array of items matching grep. grep is either a subroutine expecting a token as argument or a string naming a tag to match.

nbstr(string,keepWide)

Return an array of HTML tokens with the white space in string replaced with &nbsp;.

UNIMPLEMENTED: if the optional flag keepWide is supplied, uses as many &nbsp;s as spaces in the original text, otherwise uses just one between words.

URLs(string)

OBSOLETE. Return all the URls referenced by HREF= attributes from the string.


OBJECT CREATION

new cs::HTML source

Attach to the cs::Source object source, ready to return HTML tokens via the Tok method, below.

new cs::HTML SourceType,SourceArgs...

Call new cs::Source SourceType,SourceArgs... to open a cs::Source object and attach, ready to return HTML tokens via the Tok method, below.


OBJECT METHODS

Tok(close,pertok)

Fetch the next HTML token from the source.

close is an optional hashref whose keys name tags which imply a close of an active (``open'') tag (for example, an opening <TR> tag implicitly closes any active <TD> tag).

pertok is an optional subroutine to manipulate a tag before it is returned from Tok. It takes the token as argument.

Tok returns completed tags, with nested structure embedded in the TOKENS field of the returned token. Per-markup element parsing should use the cs::SGML module, on which cs::HTML is built.


SEE ALSO

cs::HTML::Form(3), cs::CGI(3), cs::SGML(3), cs::Sink(3), cs::Source(3), cs::Tokenise(3)


AUTHOR

Cameron Simpson <cs@zip.com.au>