Helpful Information

Category: Perl Programming

Stripping HTML tags with regular expressions

So I need a regular expression to strip out all HTML tags EXCEPT the ones I've allowed.

I think I almost have it, but I can't get the negation right..

"/</?(^IMG|A|FONT|B|I|U|STRONG|EM|CODE|PRE|H1|H2|H3|H4|H5|H6)(.)>?/i"

Now the ^ isn't negating because it's not in a class. So how would I negate all those tags (meaning match anything EXCEPT those?)

Also, what's a better alternative to the . match so that they can't just throw a newline in there and **** things up?

As you note, the negation appears to not be working because '^' serves as a negator for character classes (the [] construct.)

Possibly you could set the match search to a negated match search? (Change the =~ ?)

Also, what's a better alternative to the .* match so that they can't just throw a newline in there and f*** things up?

I've read that trying [^>]* will encompass everything (including n) until the close of the tag; will this help?

Hi ,
Am having the same issue and would like to know the answer of this post ...
I need a regular expression to strip out all HTML tags EXCEPT the ones I've allowed. ..
I have made many attempts but none succeed ..
Your answer will be highly appreciated ..

Regards ..
Miss Moon ;)

HTML::TokeParser::Simple does this pretty well.

my $parser = HTML::TokeParser::Simple->new(string => $html);
my $clean_html;
while ( my $token = $parser->get_token ) {
next unless (($token->is_text) || ($token->is_tag(qr/^img$|^[pbuia]$|^font$|^strong$|^em$|^code$|^pre$|^h\d{1}$|/ )));
$clean_html .= $token->as_is;
}

Another alternative, that's more suited to this specific task: HTML::TagFilter or HTML::Restrict looks good too.

Either way, I wouldn't recommend trying to use regular expressions for this.

Hi ,
Ya the regular expressions might not be the better solution for this but in my case i need it to be done using the regular expressions .. check the following of my attempts :

regsub -all {<(?!i|b|h3|h4|/i|/b|/h3|/h4)[^>]+>} $html {} html5

or

regsub -all {<(.|\n)[^p]+?>} $html {} html1

But each has its disadvantages .. Any idea about better solution ?

Regards ..
Miss Moon

Here You go :

{<(?!i|b|h[1-6]|/i|/b|/h[1-6][\s|>|/])[^>]*>}

Regards ..
Miss Moon

Best not to roll your own regular expression for this. It is better to use a time-tested perl module such as HTML::Parser or HTML::TokenParser. Using your own regular expression is certain not to work for special cases. For instance, Miss Moon, your regexp will not work for tags such as '<h1 class="foo">'.

See http://perldoc.perl.org/perlfaq6.html#How-do-I-match-XML%2C-HTML%2C-or-other-nasty%2C-ugly-things-with-a-regex%3F for an official Perl answer.

Also remember jzawinski's famous saying:
'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.'