Web Browsing With Python

| 5 Comments | No TrackBacks

Python provides a stateful web browsing module called mechanize, named after Perl's mature and featureful WWW::Mechanize. Though it isn't as powerful as the Perl version, mechanize provides an easy-to-use framework for browsing web pages including interacting with forms and accessing SSL content. The documentation for mechanize on the web is sparse, but viewing the source file (/usr/lib/python2.5/site-packages/mechanize/_mechanize.py on my Ubuntu machine) provides some needed insight. Here's a quick overview of the operation of mechanize:

#!/usr/bin/python
import re
from mechanize import Browser
br = Browser()

# Ignore robots.txt
br.set_handle_robots( False )
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Firefox')]

# Retrieve the Google home page, saving the response
br.open( "http://google.com" )

# Select the search box and search for 'foo'
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'

# Get the search results
br.submit()

# Find the link to foofighters.com; why did we run a search?
resp = None
for link in br.links():
    siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
    if siteMatch:
        resp = br.follow_link( link )
        break

# Print the site
content = resp.get_data()
print content

That's a pretty straightforward and simple usage. The get_data() method gives you the HTML content of the pages, which I often find suitable to run a .split('\n') and then do some regex on line by line.

No TrackBacks

TrackBack URL: http://dinomite.net/cgi-bin/mt/mt-tb.cgi/193

5 Comments

Neato!

I took a different approach to web requests, although your goal was slightly different than mine. To fetch a page, I used a urllib opener. This was necessary for me because google was returning a 403 for my automated search requests without a spoofed user agent, and urrlib supports changing header information.

Cool demonstration - I like your method better, it wins for simplicity.

Here is my different, uglier version: http://www.robertpeaslee.com/~robertp/src/Define.txt

And it's web-capable Python Server Pages port (woot) : http://www.robertpeaslee.com/~robertp/python/search.html

Keep sharing your code!

This code doesn't work. 1. You need user-agent for google 2. You send "submit" twice

Correct version:

!/usr/bin/python

import re from mechanize import Browser br = Browser()

Ignore robots.txt

br.sethandlerobots( False ) br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11')]

Retrieve the Google home page, saving the response

br.open( "http://google.com" )

Select the search box and search for 'foo'

br.select_form( 'f' ) br.form[ 'q' ] = 'foo' br.submit()

Get the search results

br.get()

Find the link to foofighters.com; why did we run a search?

resp = None for link in br.links(): siteMatch = re.compile( 'www.foofighters.com' ).search( link.url ) if siteMatch: resp = br.follow_link( link ) break

Print the site

content = resp.get_data() print content

You are most certainly correct; code fixed.

Just wondering what you meant by "not as powerful as the perl version" I've been looking around and can't find what features are missing. Do you know?

Thanks in advance! -bg

For one thing, documentation; that's probably the biggest failing on Python's web browsing module; it may very well do all of the same things as Perl's WWW::Mechanize, but finding out what is possible requires reading the source. I haven't used Python in the intervening months since writing this post, so I don't even recall what I might have missed when using Python mechanize.

Leave a comment

Pages

About this Entry

This page contains a single entry by Drew Stephens published on December 19, 2007 5:52 PM.

Random Futurama was the previous entry in this blog.

Perl 5.10.0 is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.