开发者

What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?

开发者 https://www.devze.com 2023-01-20 17:15 出处:网络
I don\'t need to crawl the whole inte开发者_如何学编程rnet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library

I don't need to crawl the whole inte开发者_如何学编程rnet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library would be appropriate to program that?


Mechanize is very good for those sort of things.

http://mechanize.rubyforge.org/mechanize/

In particular this page will help:

http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html


Under the covers Mechanize uses Nokogiri to parse the document. Here's a simple version using Open-URI and Nokogiri to read a page, extract all links and write the HTML.

Added example:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://some.web.site'))

Accessing the links is easy. This uses CSS accessors:

hrefs = (doc/'a[href]').map{ |a| a['href'] }

This uses XPath to do the same thing:

hrefs = (doc/'//a[@href]').map{ |a| a['href'] }

Saving the content is easy. Create a file, and ask Nokogiri to spit it out as HTML:

File.new('some_web_site.html', 'w') { |fo| fo.puts doc.to_html }
0

精彩评论

暂无评论...
验证码 换一张
取 消