开发者

python/scrapy question: How to avoid endless loops

开发者 https://www.devze.com 2023-03-20 14:58 出处:网络
I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a \'back\' and \'next\' button.开发者_JAVA百科 The URLs are in the format

I am using the web-scraping framework, scrapy, to data mine some sites. I am trying to use the CrawlSpider and the pages have a 'back' and 'next' button.开发者_JAVA百科 The URLs are in the format

www.qwerty.com/###

where ### is a number that increments every time the next button is pressed. How do I format the rules so that an endless loop doesn't occur.

Here is my rule:

rules = (
        Rule(SgmlLinkExtractor(allow='http://not-a-real-site.com/trunk-framework/791'),follow=True,callback='parse_item',
    ),
)


Endless loop shouldn't happen. Scrapy will filter out duplicate urls.


what makes you think the program will go into infinite loop, how have you tested it? scrapy wont download a url if it has already done it before. Did you try to go through all the pages, what happens when you click next on the last page?

You can get into infinite loop If the site generates a new number every time the next link is pressed. Although the case is broken site code but you can put a limit on the max number of pages in your code to avoid looping indefinitely.


You can set a limit on number of links to follow: use DEPTH_LIMIT setting.

Alternatively you can check the current depth in a parse callback function :

def parse(self, response):
    if response.meta['depth'] > 100:
        print 'Loop?'
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号