Scraping English Words using Python_问答_开发者

开发者 https://www.devze.com 2023-03-14 10:38 出处：网络

I would like to scrape all开发者_JAVA技巧 English words from, say, New York Times front page. I wrote something like this in Python:

import re
from urllib import FancyURLopener

class MyOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'            

opener = MyOpener()
url = "http://www.nytimes.com"
h = opener.open(url)
content = h.read()
tokens = re.findall("\s*(\w*)\s*", content, re.UNICODE) 
print tokens

This works okay, but I get HTML keywords such as "img", "src" as well as English words. Is there a simple way to get only English words from Web scaping / HTML ?

I saw this post, it only seems to talk about the mechanics of scraping, none of the tools mentioned talk about how to filter out non-language elements. I am not interested in links, formatting, etc. Just plain words. Any help would be appreciated.

Are you sure you want "English" words -- in the sense that they appear in some dictionary? For example, if you scraped an NYT article, would you want to include "Obama" (or "Palin" for you Blue-Staters out there), even though they probably don't appear in any dictionaries yet?

Better, in many cases, to parse the HTML (using BeautifulSoup as Bryan suggests) and include only the text-nodes (and maybe some aimed-at-humans attributes like "title" and "alt").

You would need some sort of English dictionary reference. A simple way of doing this would be to use a spellchecker. PyEnchant comes to mind.

From the PyEnchant website:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

In your case, perhaps something along the lines of:

d = enchant.Dict("en_US")
english_words = [tok for tok in tokens if d.check(tok)]

If that's not enough and you don't want "English words" that may appear in an HTML tag (such as an attribute) you could probably use BeautifulSoup to parse out only the important text.

Html2Text can be a good option.

import html2text

print html2text.html2text(your_html_string)

I love using the lxml library for this:

# copypasta from http://lxml.de/lxmlhtml.html#examples
import urllib
from lxml.html import fromstring
url = 'http://microformats.org/'
content = urllib.urlopen(url).read()
doc = fromstring(content)
els = el.find_class(class_name)
if els:
    return els[0].text_content()

Then to ensure the scraped words are only English words you could use look them up in a dictionary you load from a text file or NLTK that comes with many cool corpora and language processing tools.

You can replace all <.*> with nothing or a space. Use the re module, and make sure you understand greedy and non greedy pattern matching. You need non-greedy for this.

Then once you have stripped all the tags, apply the strategy you were using.