开发者

Sort words by their usage

开发者 https://www.devze.com 2023-04-13 03:19 出处:网络
I have a list of english words (approx 10000) and I\'d like to sort them by their usage as they occur in literature, newspaper, blogs etc. Can I sort them in Py开发者_如何学Pythonthon or other languag

I have a list of english words (approx 10000) and I'd like to sort them by their usage as they occur in literature, newspaper, blogs etc. Can I sort them in Py开发者_如何学Pythonthon or other language? I heard about NLTK which is the closest library I know that could help. Or is this task for other tool?

thank you


Python and NLTK are the perfect tools to sort your wordlist, as the NLTK comes with some corpora of the english language, from which you can extract frequency information.

The following code will print a given wordlist in the order of word frequency in the brown corpus:

import nltk
from nltk.corpus import brown

wordlist = ["corpus","house","the","Peter","asdf"]
# collect frequency information from brown corpus, might take a few seconds
freqs = nltk.FreqDist([w.lower() for w in brown.words()])
# sort wordlist by word frequency
wordlist_sorted = sorted(wordlist, key=lambda x: freqs[x.lower()], reverse=True)
# print the sorted list
for w in wordlist_sorted:
    print w

output:

>>> 
the
house
Peter
corpus
asdf

If you want to use a different corpus or get more information you should have a look at chapter 2 of the nltk book.


You can use collections.Counter. The code is then as easy as :

l = get_iterable_or_list_of_words() # That is up to you
c = collections.Counter(l)
print(c.most_common())


I don't know much about natural language processing, but I think Python is an ideal language for you to use for the purpose.

A Google search for "Python natural language" found:

http://www.nltk.org/

A search of StackOverflow found this answer:

Python or Java for text processing (text mining, information retrieval, natural language processing)

Which in turn linked to Pattern:

http://www.clips.ua.ac.be/pages/pattern

You might want to take a look at Pattern, that seems promising.

Good luck and have fun!

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号