Web Browsing With Python

Dec 19, 2007

Python provides a stateful web browsing module called mechanize, named after Perl’s mature and featureful WWW::Mechanize. Though it isn’t as powerful as the Perl version, mechanize provides an easy-to-use framework for browsing web pages including interacting with forms and accessing SSL content. The documentation for mechanize on the web is sparse, but viewing the source file (/usr/lib/python2.5/site-packages/mechanize/_mechanize.py on my Ubuntu machine) provides some needed insight. Here’s a quick overview of the operation of mechanize:

import re
from mechanize import Browser
br = Browser()

# Ignore robots.txt
br.set_handle_robots( False )
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Firefox')]

# Retrieve the Google home page, saving the response
br.open( "http://google.com" )

# Select the search box and search for 'foo'
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'

# Get the search results

# Find the link to foofighters.com; why did we run a search?
resp = None
for link in br.links():
    siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
    if siteMatch:
        resp = br.follow_link( link )

# Print the site
content = resp.get_data()
print content

That’s a pretty straightforward and simple usage. The get_data() method gives you the HTML content of the pages, which I often find suitable to run a .split('\n') and then do some regex on line by line.