开发者

How do I count the number of words (text) in an HTML source

开发者 https://www.devze.com 2023-03-06 05:46 出处:网络
I have some html documents for which I need to return the number of words in the document. This count should only include actual text (so no html 开发者_开发技巧tags e.g. html, br, etc).

I have some html documents for which I need to return the number of words in the document. This count should only include actual text (so no html 开发者_开发技巧tags e.g. html, br, etc).

Any ideas how to do this? Naturally, I would prefer to re-use some code.

Thanks,

Assaf


  • Strip out the HTML tags , get the text content , reuse Jsoup

  • Read file line by line , hold a Map<String, Integer> wordToCountMap and read through and operate on the Map


Solution with jsoup

private int countWords(String html) throws Exception {
    org.jsoup.nodes.Document dom = Jsoup.parse(html);
    String text = dom.text();

    return text.split(" ").length;
}


I would add an extra step to Jigar's answer:

  • Parse out the document text using JSoup or Jericho or Dom4j
  • Tokenise the resulting text. This depends on your definition of a "word". It is unlikely to be as simple as splitting on white-space. And you'll need to deal with punctuation etc. So take a look at the various Tokeniser's available e.g from the Lucene or Stanford NLP projects. Here are some simple examples you will encounter:

    "Today I'm going to New York!" - Is "I'm" one word or two? What about "New York"?

    "We applied two meta-filters in the analysis" - Is "meta-filter" one word or two?

And what about badly formatted text, e.g missing of a space at the end of a sentence:

"So we went there.And on arrival..."

Tokenising is tricky...

  • Iterate through your tokens and count them up, e.g using a HashMap.
0

精彩评论

暂无评论...
验证码 换一张
取 消