We are building a database of scientific papers and performing analysis on the abstracts. The goal is to be able to say "Interest in this topic has gone up 20% from last year". I've already tried key word analysis and haven't really liked the results. So now I am trying to move onto phrases and proxim开发者_高级运维ity of words to each other and realize I'm am in over my head. Can anyone point me to a better solution to this, or at very least give me a good term to google to learn more?
The language used is python but I don't think that really affects your answer. Thanks in advance for the help.
It is a big subject, but a good introduction to NLP like this can be found with the NLTK toolkit. This is intended for teaching and works with Python - ie. good for dabbling and experimenting. Also there's a very good open source book (also in paper form from O'Reilly) on the NLTK website.
This is just a guess; not sure if this approach will work. If you're looking at phrases and proximity of words, perhaps you can build up a Markov Chain? That way you can get an idea of the frequency of certain phrases/words in relation to others (based on the order of your Markov Chain).
So you build a Markov Chain and frequency distribution for the year 2009. Then you build another one at the end of 2010 and compare the frequencies (of certain phrases and words). You might have to normalize the text though.
Other than that, something that comes to mind is Natural-Language-Processing techniques (there is a lot of literature surrounding the topic!).
精彩评论