How would I start on 开发者_如何学Ca single web page, let's say at the root of DMOZ.org and index every single url attached to it. Then store those links inside a text file. I don't want the content, just the links themselves. An example would be awesome.
This, for instance, would print out links on this very related (but poorly named) question:
import urllib2
from BeautifulSoup import BeautifulSoup
q = urllib2.urlopen('https://stackoverflow.com/questions/3884419/')
soup = BeautifulSoup(q.read())
for link in soup.findAll('a'):
if link.has_key('href'):
print str(link.string) + " -> " + link['href']
elif link.has_key('id'):
print "ID: " + link['id']
else:
print "???"
Output:
Stack Exchange -> http://stackexchange.com
log in -> /users/login?returnurl=%2fquestions%2f3884419%2f
careers -> http://careers.stackoverflow.com
meta -> http://meta.stackoverflow.com
...
ID: flag-post-3884419
None -> /posts/3884419/revisions
...
If you insist on reinventing the wheel, use an html parser like BeautifulSoup to grab all the tags out. This answer to a similar question is relevant.
Scrapy is a Python framework for web crawling. Plenty of examples here: http://snippets.scrapy.org/popular/bookmarked/
精彩评论