I have crawled a few pages with Java Nutch Also I have made a module with Lucene in Java which allows execute queries on indexed documents. I know I created Nutch fields like url, weight and the title. But I am interested in capturing the content of each page. Ho开发者_StackOverfloww I can do it using Lucene and knowing I have crawled with nutch?
Thanks
You need to give more details about what you want to achieve... because Nutch already includes a Lucene Index so I wonder why you want another one???? Nutch has a jsp front-end where you can look at, and find how to query for some field content. There is a cache system implemented so you can retrieve the cached data of page, but then you have to parse it again and index it again.
精彩评论