开发者

Extract text from PDF(I have link to PDF) in ruby

开发者 https://www.devze.com 2023-02-08 11:34 出处:网络
I have a link like http://www.downloads.com/help.pdf I want to download this, and parse it to get th开发者_如何学Goe text content.

I have a link like

      http://www.downloads.com/help.pdf

I want to download this, and parse it to get th开发者_如何学Goe text content.

How do I go about this? I also plan to tag-ize(if there is a word like that) the extracted text


You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader

Or the command-line utility pdftotext.


The Yomu gem will also be able to extract the text from a PDF (as well as other MIME types) for you.

require 'yomu'
Yomu.new(file_path).text


You can also take a look at DocRipper, a gem I maintain, that provides a Ruby interface for text extraction from a number of document formats including PDF, doc, docx and sketch.

DocRipper uses pdftotext under the hood and avoids Java dependencies.

require 'doc_ripper'

DocRipper::rip('/path/to/file.pdf') => "Pdf text"

You can read remote files using the Ruby standard library:

require 'open-uri'
require 'doc_ripper'

tmp_file = open("some_uri")
DocRipper::rip(tmp_file.path)
0

精彩评论

暂无评论...
验证码 换一张
取 消