Web mining or scraping or crawling? What tool/library should I use? [closed]_问答_开发者

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

开发者_StackOverflow中文版

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

I want to crawl and save some webpages as HTML. Say, crawl into hundreds popular websites and simply save their frontpages and the "About" pages.

I've looked into many questions, but didn't find an answer to this from either web crawling or web scraping questions.

What library or tool should I use to build the solution? Or is there even some existing tools that can handle this?

When going Python, you might be interested in mechanize and BeautifulSoup.

Mechanize sort of simulates a browser (including options for proxying, faking browser identifications, page redirection etc.) and allows easy fetching of forms, links, ... The documentation is a bit rough/sparse though.

Some example code (from the mechanize website) to give you an idea:

import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
html_response = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
print br.title()
print  html_response

BeautifulSoup allows to parse html content (which you could have fetched with mechanize) pretty easily, and supports regexes.

Some example code:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_response)

rows = soup.findAll('tr')
for r in rows[2:]:  #ignore first two rows
    cols = r.findAll('td')
    print cols[0].renderContents().strip()    #print content of first column

So, these 10 lines above are pretty much copy-paste ready to print the content of the first column of every table row on a website.

There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression.

In order to implement something like this, strong knowledge of SEO (Search Engine Optimization) would help since effectively optimizing a webpage for search engines tells you how search engines behave. I would start with a site like SEOMoz.

As far as identifying the "about us" page, you only have 2 options:

a) For each page get the link of the about us page and feed it to your crawler.

b) Parse all the links of the page for certain keywords like "about us", "about" "learn more" or whatever.

in using option b, be careful as you could get stuck in an infinite loop since a website will link to the same page many times especially if the link is in the header or footer a page may link back to itself even. To avoid this you'll need to create a list of visited links and make sure not to revisit them.

Finally, I would recommend having your crawler respect instructions in the robot.txt file and it's probably a great idea not to follow links marked rel="nofollow" as these are mostly used on external links. Again, learn this and more by reading up on SEO.

Regards,

Try out scrapy. It is a web scraping library for python. If a simple python-script is expected, try urllib2 in python.

Python ==> Curl <-- the best implementation of crawler

The following code can crawl 10,000 pages in 300 secs on a nice server.

#! /usr/bin/env python
# -*- coding: iso-8859-1 -*-
# vi:ts=4:et
# $Id: retriever-multi.py,v 1.29 2005/07/28 11:04:13 mfx Exp $

#
# Usage: python retriever-multi.py <file with URLs to fetch> [<# of
#          concurrent connections>]
#

import sys
import pycurl

# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
    import signal
    from signal import SIGPIPE, SIG_IGN
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
    pass


# Get args
num_conn = 10
try:
    if sys.argv[1] == "-":
        urls = sys.stdin.readlines()
    else:
        urls = open(sys.argv[1]).readlines()
    if len(sys.argv) >= 3:
        num_conn = int(sys.argv[2])
except:
    print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
    raise SystemExit


# Make a queue with (url, filename) tuples
queue = []
for url in urls:
    url = url.strip()
    if not url or url[0] == "#":
        continue
    filename = "doc_%03d.dat" % (len(queue) + 1)
    queue.append((url, filename))


# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print "PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM)
print "----- Getting", num_urls, "URLs using", num_conn, "connections -----"


# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
    c = pycurl.Curl()
    c.fp = None
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 30)
    c.setopt(pycurl.TIMEOUT, 300)
    c.setopt(pycurl.NOSIGNAL, 1)
    m.handles.append(c)


# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
    # If there is an url to process and a free curl object, add to multi stack
    while queue and freelist:
        url, filename = queue.pop(0)
        c = freelist.pop()
        c.fp = open(filename, "wb")
        c.setopt(pycurl.URL, url)
        c.setopt(pycurl.WRITEDATA, c.fp)
        m.add_handle(c)
        # store some info
        c.filename = filename
        c.url = url
    # Run the internal curl state machine for the multi stack
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break
    # Check for curl objects which have terminated, and add them to the freelist
    while 1:
        num_q, ok_list, err_list = m.info_read()
        for c in ok_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL)
            freelist.append(c)
        for c, errno, errmsg in err_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print "Failed: ", c.filename, c.url, errno, errmsg
            freelist.append(c)
        num_processed = num_processed + len(ok_list) + len(err_list)
        if num_q == 0:
            break
    # Currently no more I/O is pending, could do something in the meantime
    # (display a progress bar, etc.).
    # We just call select() to sleep until some more data is available.
    m.select(1.0)


# Cleanup
for c in m.handles:
    if c.fp is not None:
        c.fp.close()
        c.fp = None
    c.close()
m.close()

If you are going to builld a crawler you need to (Java specific):

Learn how to use the java.net.URL and java.net.URLConnection classes or use the HttpClient library
Understand http request/response headers
Understand redirects (both HTTP, HTML and Javascript)
Understand content encodings (charsets)
Use a good library for parsing badly formed HTML (e.g. cyberNecko, Jericho, JSoup)
Make concurrent HTTP requests to different hosts, but ensure you issue no more than one to the same host every ~5 seconds
Persist pages you have fetched, so you don't need to refetch them every day if they don't change that often (HBase can be useful).
A way of extracting links from the current page to crawl next
Obey robots.txt

A bunch of other stuff too.

It's not that difficult, but there are lots of fiddly edge cases (e.g. redirects, detecting encoding (checkout Tika)).

For more basic requirements you could use wget. Heretrix is another option, but yet another framework to learn.

Identifying About us pages can be done using various heuristics:

inbound link text
page title
content on page
URL

if you wanted to be more quantitative about it you could use machine learning and a classifier (maybe Bayesian).

Saving the front page is obviously easier but front page redirects (sometimes to different domains, and often implemented in the HTML meta redirect tag or even JS) are very common so you need to handle this.

Heritrix has a bit of a steep learning curve, but can be configured in such a way that only the homepage, and a page that "looks like" (using a regex filter) an about page will get crawled.

More open source Java (web) crawlers: http://java-source.net/open-source/crawlers