开发者

Crawling within a pdf

开发者 https://www.devze.com 2022-12-30 10:16 出处:网络
I\'m developing a tool 开发者_如何学运维that searches the keyword entered by the user on a given site. My problem is, it searches the keyword only on html/web pages but not on the PDF/MS-Word files fo

I'm developing a tool 开发者_如何学运维that searches the keyword entered by the user on a given site. My problem is, it searches the keyword only on html/web pages but not on the PDF/MS-Word files found on the site.

Can anyone suggest me some api/tool or provide the code that can search text from the given online PDF/MS-Word/Text file?


You could probably use Antiword for word files.

pdftotext can be used for pdf-files.

Both commands available through apt: sudo apt-get install xpdf-utils antiword


Developing in anything that runs on the JVM, you would probably do best using POI for MS Office document parsing and PDFBox, JPedal or PDF Clown for parsing .pdfs.

For general indexing, you wont miss with lucene and nutch.

0

精彩评论

暂无评论...
验证码 换一张
取 消