Python provides a stateful web browsing module called mechanize, named after Perl's mature and featureful WWW::Mechanize. Though it isn't as powerful as the Perl version, mechanize provides an easy-to-use framework for browsing web pages including interacting with forms and accessing SSL content. The documentation for mechanize on the web is sparse, but viewing the source file (/usr/lib/python2.5/site-packages/mechanize/_mechanize.py on my Ubuntu machine) provides some needed insight. Here's a quick overview of the operation of mechanize:
#!/usr/bin/python
import re
from mechanize import Browser
br = Browser()
# Ignore robots.txt
br.set_handle_robots( False )
# Google demands a user-agent that isn't a robot
br.addheaders = [('User-agent', 'Firefox')]
# Retrieve the Google home page, saving the response
br.open( "http://google.com" )
# Select the search box and search for 'foo'
br.select_form( 'f' )
br.form[ 'q' ] = 'foo'
# Get the search results
br.submit()
# Find the link to foofighters.com; why did we run a search?
resp = None
for link in br.links():
siteMatch = re.compile( 'www.foofighters.com' ).search( link.url )
if siteMatch:
resp = br.follow_link( link )
break
# Print the site
content = resp.get_data()
print content
That's a pretty straightforward and simple usage. The get_data() method gives you the HTML content of the pages, which I often find suitable to run a .split('\n') and then do some regex on line by line.
Neato!
I took a different approach to web requests, although your goal was slightly different than mine. To fetch a page, I used a urllib opener. This was necessary for me because google was returning a 403 for my automated search requests without a spoofed user agent, and urrlib supports changing header information.
Cool demonstration - I like your method better, it wins for simplicity.
Here is my different, uglier version: http://www.robertpeaslee.com/~robertp/src/Define.txt
And it's web-capable Python Server Pages port (woot) : http://www.robertpeaslee.com/~robertp/python/search.html
Keep sharing your code!
This code doesn't work. 1. You need user-agent for google 2. You send "submit" twice
Correct version:
!/usr/bin/python
import re from mechanize import Browser br = Browser()
Ignore robots.txt
br.sethandlerobots( False ) br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11')]
Retrieve the Google home page, saving the response
br.open( "http://google.com" )
Select the search box and search for 'foo'
br.select_form( 'f' ) br.form[ 'q' ] = 'foo' br.submit()
Get the search results
br.get()
Find the link to foofighters.com; why did we run a search?
resp = None for link in br.links(): siteMatch = re.compile( 'www.foofighters.com' ).search( link.url ) if siteMatch: resp = br.follow_link( link ) break
Print the site
content = resp.get_data() print content
You are most certainly correct; code fixed.
Just wondering what you meant by "not as powerful as the perl version" I've been looking around and can't find what features are missing. Do you know?
Thanks in advance! -bg
For one thing, documentation; that's probably the biggest failing on Python's web browsing module; it may very well do all of the same things as Perl's WWW::Mechanize, but finding out what is possible requires reading the source. I haven't used Python in the intervening months since writing this post, so I don't even recall what I might have missed when using Python mechanize.