开发者

How can I extract the first paragraph of a PDF document using Perl's CAM::PDF?

开发者 https://www.devze.com 2022-12-09 01:37 出处:网络
How can I extract the first paragraph of开发者_开发百科 a PDF document using Perl\'s CAM::PDF?print CAM::PDF->new(\'file.pdf\')->getPageText(1);

How can I extract the first paragraph of开发者_开发百科 a PDF document using Perl's CAM::PDF?


print CAM::PDF->new('file.pdf')->getPageText(1);

will get you all of the text from the page. But, CAM::PDF is definitely not the best tool for this particular job (I'm the author). I added text extraction as a whim just to see if I could do it.


Plain PDF really is not a markup language. Text is drawn at specific locations. There is something called Tagged PDF and if your documents are tagged, your job might be easier.

I would be inclined to run the documents through a PDF to text translator and grab the first chunk of text out of that if text is stored as text in your PDF and not images.

0

精彩评论

暂无评论...
验证码 换一张
取 消