开发者

Generating db_gone urls for fetch

开发者 https://www.devze.com 2023-03-18 00:20 出处:网络
In my crawler system, I have set the fetch interval as 30 days. I initially set my user agent as say \"....\" then many urls are getting rejected. But after changing my user agent to appropriate name,

In my crawler system, I have set the fetch interval as 30 days. I initially set my user agent as say "...." then many urls are getting rejected. But after changing my user agent to appropriate name, I want to fetch those urls which are rejected initially. But the thing is those urls with the db_gone status will have retry interval as 45 days. So generator wont pick that.Hence in this case how would I fetch those urls with db_gone status?

Is nutch by default has any options to crawl those db_gone urls alon开发者_StackOverflow中文版e?

Or do I need to write a seperate map-reduce program to collect those urls and use freegen to generate segments for them?


You just need to configure nutch-site.xml with a different refetch interval.

ADDITION

<property> <name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. </description>
</property>

0

精彩评论

暂无评论...
验证码 换一张
取 消