I needed a simple html only scraper. (This doesn't use js, won't pull down data via AJAX). I found an example on another site, thetaranights.com, but it wasn't exactly what I needed. It only pulled the data and printed it to screen. I added a list to loop through and auto saving by url name to a html file.
import mechanize \#pip install mechanize br = mechanize.Browser() br.set\_handle\_robots(False) br.addheaders = \[("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:220.127.116.11) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")\] sign\_in = br.open("https://this.example.com/login") \#the login url br.select\_form(nr = 0) \#accessing form by their index. Since we have only one form in this example, nr =0. \#br.select\_form(name = "form name") Alternatively you may use this instead of the above line if your form has name attribute available. br\["email"\] = "email or username" \#the key "username" is the variable that takes the username/email value br\["password"\] = "password" \#the key "password" is the variable that takes the password value logged\_in = br.submit() \#submitting the login credentials logincheck = logged\_in.read() \#reading the page body that is redirected after successful login urls = \["https://this.example.com/some/page","https://this.example.com/some/page2"\] for url in urls: req = br.open(url).read() filename = url.split('/')\[-1\] + ".html" f = open(filename, 'w') f.write(req) f.close()
Which produces 2 files: