Perl: Automated Browsing or Testing Web Applications

While working with web applications, there are two cases when the ability to browse pragmatically web pages can be very useful: testing and screen scraping. This Perl script provides the ground basis for both - all you need is to insert the specific code for processing a page.

I was playing with Perl yesterday, when I decided to try this out as an exercise. I was quite familiar with WWW::Mechanize for Ruby, thus the Perl version wasn't all that much different.

The algorithm uses two "stacks" - one for the links to be visited and one for the visited links. Unfortunately this means that the site won't be browsed exactly as a user does - instead, it will be a breadth first traversal. The external links are filtered out and no link will be visited twice. If you need HTTPS support, you might need to install some other modules as well.

#!/usr/bin/perl -w
 
use strict;
use WWW::Mechanize;
use LWP::Protocol::http;
use URI::http;
 
#
# Gets (returns) the links from the specified page.
#	@param agent 	the mechanize agent to be used
#	@param url	the URL for which the links will be retrieved
#
sub get_page_links {
	my ($agent, $url) = @_;
	my @links;
 
	print "- $url\n";
 
	# Catch any errors that might occur.
	eval {
		# Retrieves the page from the specified URL.
		$agent->get($url);
 
		# Takes the links from the current page, filters out
		# the ones that lead to external sites and adds the 
		# valid ones to the result list.
		for my $link ($agent->links) {
			if (accept_link($link->url_abs(), URI->new($url))) {
				push(@links, $link->url_abs()->as_string());
			}
		}
 
		# You can do any other processing of the retrieved page here.
		#
		# processPage($url, $agent->content);
	};
 
	if ($@) {
		print "error: ", $@, "\n";
	}
 
	return @links;
}
 
#
# Main body, the actual browsing starts here.
#
sub browse {
	my ($agent, $start_link) = @_;
	my @to_visit_links = ( $start_link );
	my @visited_links = ();
 
	# Attempt to browse until no more links are in the queue.
	until (@to_visit_links == 0) {
		my $current_link = shift(@to_visit_links);
 
		# Filters the page link to ensure that no link will be 
		# visited twice.
		push(@to_visit_links, 
			get_unvisited_links(get_page_links($agent, $current_link), 
			@visited_links));
 
		# Marks the current link as visited
		push(@visited_links, $current_link);
 
		# Don't overload the servers
		sleep(1);
	}
}
 
#
# Tests whether the given link should be accepted or not. A link is 
# accepted if it belongs to the same domain as the parent one.
#
sub accept_link {
	my ($link, $parent_link) = @_;
 
	# Feel free to add any logic for URL filtering here.
 
	return ($link->authority and ($parent_link->host() eq $link->host()));
}
 
#
# Computes the set difference between the second set (array) and the
# first one. In common words, it returns only the links from the first
# set that are not present in the second one.
#
sub get_unvisited_links {
	my (@new_links, @visited_links) = @_;
	my @result_links;
 
	# This can be improved further by using a hash instead of an
	# array.
	for my $link (@new_links) {
		my $found = 0;
		for my $visited_link (@visited_links) {
			if ($visited_link eq $link) {
				$found = 1;
				last;
			}
		}
 
		if (!$found) {
			push(@result_links, $link);
		}
	}
 
	return @result_links;
}
 
 
#
# Main body.
#
 
@ARGV > 0 or die("Not enough arguments: you need to specify the start URL.");
 
my $mech = WWW::Mechanize->new();
 
browse($mech, $ARGV[0]);

In order to run the script, you will need the WWW::Mechanize module installed, which you can get from your local CPAN mirror.