While testing a web application, sometimes it's useful to batch test a number of URLs. Even more, some applications might not be too friendly towards free browsing, thus when testing something like this, it might be a good idea to keep some sort of state on the client (cookies for example).
I know that one of the best tools for link checking is simply wget. It can do anything that you want, but it's too generic to be proficient. This is why I came up with the following script. It reads the links from the specified file and "visits" (i.e. retrieves) them - while also storing any cookies that might be passed. Thus, all visits appear to have come from the same browser, with the same cookies set and the session is preserved.
For each encountered error, an entry is made in the report file. No other data is saved in memory.
require "Mechanize" # # Retrieves the links read from the given input file and reports # any errors in the report file. # def checkLinks(agent, inputFile, reportFile) failures = 0 links = 0 while (line = inputFile.gets) puts timestamp() + " - getting: " + line begin # retrieves the content from the specified address agent.get(line) rescue => e reportFile.puts(timestamp() + " - failed to read from #{line} (#{e.message})") failures = failures + 1 end links = links + 1 end puts "\nTotal links: #{links}" puts " Failures: #{failures}" end # # Returns the current timestamp # def timestamp() t = Time.now return t.strftime("%Y/%m/%d %H:%M:%S") end # Validates the number of arguments if ARGV.length == 1 # Consult the Mechanize documentation for optional parameters that can # be passsed to the agent agent = WWW::Mechanize.new inputFile = File.open(ARGV[0], "r") reportFile = File.open(ARGV[0]+"-report.log", "w") # Main body checkLinks(agent, inputFile, reportFile) # Cleanup inputFile.close reportFile.close else puts "Not enough arguments." puts "Syntax: linktester.rb inputfile" end
In order to run this script, you will need the Mechanize gem installed. This provides a simple browser engine, that keeps cookies (thus sessions too) and it can be used for the programmatic access of web sites.