HTML::Strip - Perl extension for stripping HTML markup from text. |
HTML::Strip - Perl extension for stripping HTML markup from text.
use HTML::Strip;
my $hs = HTML::Strip->new();
my $clean_text = $hs->parse( $raw_html ); $hs->eof;
This module simply strips HTML-like markup from text in a very quick and brutal manner. It could quite easily be used to strip XML or SGML from text as well; but removing HTML markup is a much more common problem, hence this module lives in the HTML:: namespace.
It is written in XS, and thus about five times quicker than using regular expressions for the same task.
It does not do any syntax checking (if you want that, use the HTML::Parser manpage), instead it merely applies the following rules:
<
and ends with a >
; with the caveat that a
>
character may appear in either of the following without
ending the tag:
'
or a "
character,
and end with a matching character not preceded by an even number or
escaping slashes (i.e. \"
does not end the quote but \\\\"
does).
>
characters do not
end the tag if they appear within pairs of double dashes (e.g. <!--
E<lt>a href="old.htm"E<gt>old pageE<lt>/aE<gt> --E<gt>
would be
stripped completely). Inside a comment, no parsing for quotes
is done as well. (That means <!-- comment with ' quote " -->
are entirely stripped.)
title
, script
, style
and
applet
.
HTML::Strip maintains state between calls, so you can parse a document in chunks should you wish. If one chunk ends half-way through a tag, quote, comment, or whatever; it will remember this, and expect the next call to parse to start with the remains of said tag.
If this is not going to be the case, be sure to call $hs->eof()
between calls to $hs->parse(). Alternatively, you may
set auto_reset
to true on the constructor or any time
after with set_auto_reset
, so that the parser will always
operate in one-shot basis (resetting after each parsed chunk).
new()
set_
methods below).
For example, the following is a valid constructor:
my $hs = HTML::Strip->new( striptags => [ 'script', 'iframe' ], emit_spaces => 0 );
parse()
eof()
clear_striptags()
add_striptag()
set_striptags()
set_emit_spaces()
set_decode_entities()
filter_entities()
set_filter()
undef
otherwise.
set_auto_reset()
parse
resets after
each call (equivalent to calling eof
). Otherwise, the
parser remembers its state from one call to parse
to
another, until you call eof
explicitly. Set to false
by default.
set_debug()
decode_entities()
filter()
auto_reset()
debug()
<h1> HTML::Strip </h1> <p> <em> <strong> fast, and brutal </strong> </em> </p>
Which gives the following output:
HTML::Stripfast, and brutal
Thus, you may want to post-filter the output of HTML::Strip to remove
excess whitespace (for example, using tr/ / /s;
).
(This has been improved since previous releases, but is still an issue)
None by default.
Alex Bowley <kilinrax@cpan.org>
the perl manpage, the HTML::Parser manpage, the HTML::Entities manpage
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
HTML::Strip - Perl extension for stripping HTML markup from text. |