开发者

Retrieving pages from what.cd

开发者 https://www.devze.com 2023-04-11 20:34 出处:网络
I\'m working on a screen scraper using BeautifulSoup for what.cd using Python.I came across this script while working and decided to look at it, since it seems to be similar to what I\'m working on.Ho

I'm working on a screen scraper using BeautifulSoup for what.cd using Python. I came across this script while working and decided to look at it, since it seems to be similar to what I'm working on. However, every time I run the script I get a message that my credentials are wrong, even though they are not.

As far as I can tell, I'm getting this message because when 开发者_C百科the script tries to log into what.cd, what.cd is supposed to return a cookie containing the information that lets me request pages later in the script. So where the script is failing is:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username,
                               'password' : password})
check = opener.open('http://what.cd/login.php', login_data)
soup = BeautifulSoup(check.read())
warning = soup.find('span', 'warning')
if warning:
    exit(str(warning)+'\n\nprobably means username or pw is wrong')

I've tried multiple methods of authenticating with the site including using CookieFileJar, the script located here, and the Requests module. I've gotten the same HTML message with each one. It says, in short, that "Javascript is disabled", and "Cookies are disabled", and also provides a login box in HTML.

I don't really want to mess around with Mechanize, but I don't see any other way to do it at the moment. If anyone can provide any help, it would be greatly appreciated.


After a few more hours of searching, I found the solution to my problem. I'm still not sure why this code works as apposed to the version above, but it does. Here is the code I'm using now:

import urllib
import urllib2
import cookielib

cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)

request = urllib2.Request("http://what.cd/index.php", None)
f = urllib2.urlopen(request)
f.close()

data = urllib.urlencode({"username": "your-login", "password" : "your-password"})
request = urllib2.Request("http://what.cd/login.php", data)
f = urllib2.urlopen(request)

html = f.read()
f.close()

Credit goes to carl.waldbieser from linuxquestions.org. Thanks for everyone who gave input.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号