开发者

Nutch - Lucene - capture the content of the pages

开发者 https://www.devze.com 2023-01-28 17:51 出处:网络
I have crawled a few pages with Java Nutch Also Ihave made a module with Lucene in Java which allows execute queries on indexed documents.

I have crawled a few pages with Java Nutch Also I have made a module with Lucene in Java which allows execute queries on indexed documents. I know I created Nutch fields like url, weight and the title. But I am interested in capturing the content of each page. Ho开发者_StackOverfloww I can do it using Lucene and knowing I have crawled with nutch?

Thanks


You need to give more details about what you want to achieve... because Nutch already includes a Lucene Index so I wonder why you want another one???? Nutch has a jsp front-end where you can look at, and find how to query for some field content. There is a cache system implemented so you can retrieve the cached data of page, but then you have to parse it again and index it again.

0

精彩评论

暂无评论...
验证码 换一张
取 消