Home > Perl > Web programming > Parsing HTML Pages using HTML::Parser
Parsing HTML Pages using HTML::Parser
Written by Philip L Yuson   
Who is this for
This article is for those who want to write Perl scripts to remove tags from an HTML file.


What you need to know
You need to know:
Basic Perl scripting
HTML tags


There are times when you will need to read an HTML file and extract a field from that file. Perl has a module called HTML::Parser that simplifies this task.


This module reads an HTML file and allows you to define actions when it reads a starting tag, the body and the end tag. To do this, you can define subroutines that are to be executed during these events. The HTML::Parser documentation lists all the events that can happen during processing. For our discussion, we will discuss only the start, text and end events.

You define the subroutine to handle an event in this format:

event => [\&handler, token]

Event is the name of the event
handler is the name of the subroutine
tokens represent the values to be passed to the subroutine. To pass the tag name to the subroutine, you specify the literal 'tag'.

This will be clearer in the sample code.

Sample code

First thing to do is to create an instance of the parser. When you create the instance, you can specify which subroutine is to handle processing at a specific event.

# Define module to use
use HTML::Parser();
# Create instance
$p = HTML::Parser->new(start_h => [\&start_rtn, 'tag'],
                text_h => [\&text_rtn, 'text'],
                end_h => [\&end_rtn, 'tag']);
# Start parsing the following HTML string
Hello World
This is a test

sub start_rtn {
# Execute when start tag is encountered
    foreach (@_) {
       print "===\nStart: $_\n";
sub text_rtn {
# Execute when text is encountered
    foreach (@_) {
       print "\tText: $_\n";
sub end_rtn {
# Execute when the end tag is encountered
    foreach (@_) {
       print "End: $_\n";

Save this and run it. The result will be something like this:


=== Start: html

=== Start: head

=== Start: title
    Text: Sample HTML Page
End: /title

End: /head

=== Start: body
Hello World
This is a test

End: /body

End: /html

Notice that the text subroutine is always executed. Likewise, everytime the start tag is encountered, the start_rtn is executed.

What use is this then?
You can write routines to execute when a specific tag is encountered. You can also write routines to execute only if it is part of a specific tag.

In our example also, we passed an HTML string to the parser. You can also pass a file to it by using the parse_file($file) method of the module.

For more information
To learn more about HTML::Parser, you can check out the perl documentation for the module.

Copyright: © 2017 Philip Yuson