nutch
Crawling engine architecture - Java/ Perl integration
I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc[详细]
2022-12-14 11:09 分类:问答crawler get external website search result
What is the best practice and library I can use to key in search textbox on external website and collect the search result?[详细]
2022-12-13 23:21 分类:问答configuring nutch regex-normalize.xml
I am using the Java-based Nutch web-search software. In order to prevent duplicate (url) results from being returned in my search query results, I am trying to remove (a.k.a. normalize) the expression[详细]
2022-12-11 13:14 分类:问答Nutch issues with crwaling website where the url differes only in termes of parameters passes
I am using Nutch to crawl webistes and strangely for one of my webistes, the Nutch crawl returns only two urls, the home page url (http://mysite.com/)开发者_开发技巧 and one other.[详细]
2022-12-11 13:08 分类:问答How to enable follow Redirect in Nutch-1.0
I am using Nutch-1.0 and I am getting this log entry 2009-11-12 22:13:11,093 INFOhttpclient.HttpMethodDirector - Redirec开发者_如何学Pythont requested but followRedirects is disabled.[详细]
2022-12-11 01:35 分类:问答How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol
I want to know How can I crawl pdf files that are served on internet using Nutch-1.0 using http protocol[详细]
2022-12-08 07:29 分类:问答