What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?_问答_开发者

What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?

开发者 https://www.devze.com 2023-01-20 17:15 出处：网络

I don\'t need to crawl the whole inte开发者_如何学编程rnet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library

I don't need to crawl the whole inte开发者_如何学编程rnet, I just need to open a few URL, extract other URL, and then save some page in a way that they can be browsed on the disk later. What library would be appropriate to program that?

Mechanize is very good for those sort of things.

http://mechanize.rubyforge.org/mechanize/

In particular this page will help:

http://mechanize.rubyforge.org/mechanize/GUIDE_rdoc.html

Under the covers Mechanize uses Nokogiri to parse the document. Here's a simple version using Open-URI and Nokogiri to read a page, extract all links and write the HTML.

Added example:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://some.web.site'))

Accessing the links is easy. This uses CSS accessors:

hrefs = (doc/'a[href]').map{ |a| a['href'] }

This uses XPath to do the same thing:

hrefs = (doc/'//a[@href]').map{ |a| a['href'] }

Saving the content is easy. Create a file, and ask Nokogiri to spit it out as HTML:

File.new('some_web_site.html', 'w') { |fo| fo.puts doc.to_html }

What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?

精彩评论

关注公众号

热门标签

图文推荐

What is a Ruby web crawler library that allow xpath access and the equivalent of save as webpage complete?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：