python/scrapy question: How to avoid endless loops_问答_开发者

python/scrapy question: How to avoid endless loops

开发者 https://www.devze.com 2023-03-20 14:58 出处：网络

I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a \'back\' and \'next\' button.开发者_JAVA百科 The URLs are in the format

I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a 'back' and 'next' button.开发者_JAVA百科 The URLs are in the format

www.qwerty.com/###

where ### is a number that increments every time the next button is pressed. How do I format the rules so that an endless loop doesn't occur.

Here is my rule:

rules = (
        Rule(SgmlLinkExtractor(allow='http://not-a-real-site.com/trunk-framework/791'),follow=True,callback='parse_item',
    ),
)

Endless loop shouldn't happen. Scrapy will filter out duplicate urls.

what makes you think the program will go into infinite loop, how have you tested it? scrapy wont download a url if it has already done it before. Did you try to go through all the pages, what happens when you click next on the last page?

You can get into infinite loop If the site generates a new number every time the next link is pressed. Although the case is broken site code but you can put a limit on the max number of pages in your code to avoid looping indefinitely.

You can set a limit on number of links to follow: use DEPTH_LIMIT setting.

Alternatively you can check the current depth in a parse callback function :

def parse(self, response):
    if response.meta['depth'] > 100:
        print 'Loop?'

python/scrapy question: How to avoid endless loops

精彩评论

关注公众号

热门标签

图文推荐

python/scrapy question: How to avoid endless loops

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：