Simple Python Web Scraper


I needed a simple html only scraper. (This doesn't use js, won't pull down data via AJAX). I found an example on another site, thetaranights.com, but it wasn't exactly what I needed. It only pulled the data and printed it to screen. I added a list to loop through and auto saving by url name to a html file.

import mechanize \#pip install mechanize

br = mechanize.Browser()  
br.set\_handle\_robots(False)  
br.addheaders = \[("User-agent","Mozilla/5.0 (X11; U;
Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick)
Firefox/3.6.13")\]

sign\_in = br.open("https://this.example.com/login") \#the
login url

br.select\_form(nr = 0) \#accessing form by their index. Since we have
only one form in this example, nr =0.  
\#br.select\_form(name = "form name") Alternatively you may
use this instead of the above line if your form has name attribute
available.

br\["email"\] = "email or username" \#the key
"username" is the variable that takes the username/email value

br\["password"\] = "password" \#the key
"password" is the variable that takes the password value

logged\_in = br.submit() \#submitting the login credentials

logincheck = logged\_in.read() \#reading the page body that is
redirected after successful login

urls =
\["https://this.example.com/some/page","https://this.example.com/some/page2"\]

for url in urls:  
req = br.open(url).read()  
filename = url.split('/')\[-1\] + ".html"  
f = open(filename, 'w')  
f.write(req)  
f.close()

Which produces 2 files:
page.html
page2.html