I'm developing a tool 开发者_如何学运维that searches the keyword entered by the user on a given site. My problem is, it searches the keyword only on html/web pages but not on the PDF/MS-Word files found on the site.
Can anyone suggest me some api/tool or provide the code that can search text from the given online PDF/MS-Word/Text file?
You could probably use Antiword for word files.
pdftotext can be used for pdf-files.
Both commands available through apt: 
sudo apt-get install xpdf-utils antiword
Developing in anything that runs on the JVM, you would probably do best using POI for MS Office document parsing and PDFBox, JPedal or PDF Clown for parsing .pdfs.
For general indexing, you wont miss with lucene and nutch.
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论