开发者

How to scrape the 'More' portion of the Quora profile page?

开发者 https://www.devze.com 2023-04-09 23:47 出处:网络
To determine the list of all topics on Quora, I decided to start from scraping the profile page with many topics followed, e.g. http://www.quora.com/Charlie-Cheever/topics. I scraped the topics from t

To determine the list of all topics on Quora, I decided to start from scraping the profile page with many topics followed, e.g. http://www.quora.com/Charlie-Cheever/topics. I scraped the topics from this page, but now I need to scrape the topics from the Ajax page which is loaded when you click on 'More' button at the bottom of the page. I'm trying to find the javascript function executed upon clicking on 'More' button, but no luck yet. Here are three snippets from the html page which may be relevant:

<div class=\"pager_next action_button\" id=\"__w2_mEaYKRZ_more\">More</div>
{\"more_button\": \"mEaYKRZ\"}

\"dPs6zd5\": {\"more_button\": \"more_button\"}

new(PagedListMoreButton)(\"mEaYKRZ\",\"more_button\",{},\"live:ld_c5OMje_9424:cls:a.view.paged_list:PagedListMoreButton:/开发者_C百科TW7WZFZNft72w\",{})

Does anyone of you guys know the name of javascript function executed when clicking on 'More' button? Any help would be appreciated :)

The Python script (followed this tutorial) at this point looks like this:

#just prints topics followed by Charlie Cheevers from the 1st page
#!/usr/bin/python
import httplib2,time,re
from BeautifulSoup import BeautifulSoup
SCRAPING_CONN = httplib2.Http(".cache")

def fetch(url,method="GET"):
    return SCRAPING_CONN.request(url,method)

def extractTopic(s):
    d = {}
    d['url'] = "http://www.quora.com" + s['href']
    d['topicName'] = s.findChildren()[0].string
    return d

def fetch_stories():
    page = fetch(u"http://www.quora.com/Charlie-Cheever/topics")
    soup = BeautifulSoup(page[1])
    stories = soup.findAll('a', 'topic_name')
    topics = [extractTopic(s) for s in stories]
    for t in topics:
        print u"%s, %s\n" % (t['topicName'],t['url'])

stories = fetch_stories()


You can see it in your browser's dom inspector under Event Listeners. It's an anonymous function and looks like this:

function (){return typeof d!=="undefined"&&!d.event.triggered?d.event.handle.apply(l.elem,arguments):b}

This looks like a difficult website to scrape, you might consider using selenium.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号