开发者

Scrapy web scraper can not crawl link

开发者 https://www.devze.com 2023-01-11 23:49 出处:网络
I\'m very ne开发者_运维百科w to Scrapy. Here my spider to crawl twistedweb. class TwistedWebSpider(BaseSpider):

I'm very ne开发者_运维百科w to Scrapy. Here my spider to crawl twistedweb.

class TwistedWebSpider(BaseSpider):

    name = "twistedweb3"
    allowed_domains = ["twistedmatrix.com"]
    start_urls = [
        "http://twistedmatrix.com/documents/current/web/howto/",
    ]
    rules = (
        Rule(SgmlLinkExtractor(),
            'parse',
            follow=True,
        ),
    )
    def parse(self, response):
        print response.url
        filename = response.url.split("/")[-1]
        filename = filename or "index.html"
        open(filename, 'wb').write(response.body)

When I run scrapy-ctl.py crawl twistedweb3, it fetched only.

Getting the index.html content, I tried using SgmlLinkExtractor, it extract links as I expected but these links can not be followed.

Can you show me where I am going wrong?

Suppose I want to get css, javascript file. How do I achieve this? I mean get full website?


rules attribute belongs to CrawlSpider.Use class MySpider(CrawlSpider). Also, when you use CrawlSpider you must not override parse method, instead use parse_response or other similar name.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号