Home > Perl > Web programming > Parsing HTML Pages using HTML::Parser
Parsing HTML Pages using HTML::Parser
Written by Philip L Yuson   
Who is this for
This article is for those who want to write Perl scripts to remove tags from an HTML file.

 

What you need to know
You need to know:
Basic Perl scripting
HTML tags

Introduction

There are times when you will need to read an HTML file and extract a field from that file. Perl has a module called HTML::Parser that simplifies this task.

HTML::Parser

This module reads an HTML file and allows you to define actions when it reads a starting tag, the body and the end tag. To do this, you can define subroutines that are to be executed during these events. The HTML::Parser documentation lists all the events that can happen during processing. For our discussion, we will discuss only the start, text and end events.

You define the subroutine to handle an event in this format:

event => [\&handler, token]


Event is the name of the event
handler is the name of the subroutine
tokens represent the values to be passed to the subroutine. To pass the tag name to the subroutine, you specify the literal 'tag'.

This will be clearer in the sample code.

Sample code

First thing to do is to create an instance of the parser. When you create the instance, you can specify which subroutine is to handle processing at a specific event.

# Define module to use
use HTML::Parser();
# Create instance
$p = HTML::Parser->new(start_h => [\&start_rtn, 'tag'],
                text_h => [\&text_rtn, 'text'],
                end_h => [\&end_rtn, 'tag']);
# Start parsing the following HTML string
$p->parse('
<HTML>
<HEAD>
<TITLE>Sample HTML Page</TITLE>
</HEAD>
<BODY>
Hello World
This is a test
</BODY>
</HTML>');

sub start_rtn {
# Execute when start tag is encountered
    foreach (@_) {
       print "===\nStart: $_\n";
    }
}
sub text_rtn {
# Execute when text is encountered
    foreach (@_) {
       print "\tText: $_\n";
    }
}
sub end_rtn {
# Execute when the end tag is encountered
    foreach (@_) {
       print "End: $_\n";
    }
}


Result
Save this and run it. The result will be something like this:


    Text:

=== Start: html
    Text:

=== Start: head
    Text:

=== Start: title
    Text: Sample HTML Page
End: /title
    Text:

End: /head
    Text:

=== Start: body
    Text:
Hello World
This is a test

End: /body
    Text:

End: /html


Notice that the text subroutine is always executed. Likewise, everytime the start tag is encountered, the start_rtn is executed.

What use is this then?
You can write routines to execute when a specific tag is encountered. You can also write routines to execute only if it is part of a specific tag.

In our example also, we passed an HTML string to the parser. You can also pass a file to it by using the parse_file($file) method of the module.

For more information
To learn more about HTML::Parser, you can check out the perl documentation for the module.

 
Copyright: © 2017 Philip Yuson