Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" 开发者_JAVA百科").join(" ")
hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove
found it on http://www.savedmyday.com/2008/04/25/how-to-extract-text-from-html-using-rubyhpricot/
Nokogiri and Hpricot are pretty interchangeable. I.e. Nokogiri(html) is an equivalent of Hpricot(html). Not really sure I understand what the linked article is trying to achieve, but to:
Extract text from HTML body which includes ignoring large white spaces between tags and words.
This would be an easier approach in Hpricot, and remove the need for the hpricot.search("script").remove bits. I.e. Just get the body in the first place:
Hpricot(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
And in Nokogiri:
Nokogiri(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论