开发者

Split string into meaningful words

开发者 https://www.devze.com 2023-03-26 17:13 出处:网络
I am developing an application in Java which will parse a XML file and retrieve keywords from it and store it in my database. These keywords can then be searched by users and they can retrieve the rel

I am developing an application in Java which will parse a XML file and retrieve keywords from it and store it in my database. These keywords can then be searched by users and they can retrieve the related data.

Now the problem is that the XML file has words like "literacy_male","infa开发者_StackOverflowntmortalityrate_female" etc. For the first one I can split the words at "_" before storing, but for the second one I am not sure how i can split the word into meaningful words.

I am using Apache Lucene to do the full text search.


one possibility is increasing the index size by adding all substrings of the exact same string. so for "abc" you will store: "a","b","c","ab","bc","abc" (it's O(n^2) strings).

one more possibility is using wildcards. index whatever you have, and search for:
<term>*,a*<term>*,...,z*<term>* instead of for <term>. it will take a LOT more time, but it will not increase the index size.
note: it is necessary to search for so many terms because you CANNOT use wildcard as first letter of a term.
a*<term>* means search for all terms start with a, then have none or any chars, then <term> and then none or any chars again.

more info about terms and wild cards in lucene: http://lucene.apache.org/java/2_0_0/queryparsersyntax.html

EDIT:

a combination of those will provide (in my opinion) the best solution:
index all suffixes of the string, and then for each term (and not query!) - instead of searching for <term> search for <term>*. if the term exist as a substring, it also starts at least one prefix, and it will find it.

for example: if you have "lifeexpectancy", you will index:
"lifeexpectancy","ifeexpectancy","feexpectancy","eexpectancy",....,"y"
for the same example, when you want to search life expectancy, you will search life* expectancy*


There's no purely algorithmic way to accomplish your goal, nor is there a way to do it with high reliability. You'd basically need to have a dictionary of "meaningful" words to search, and "peel" off each word in a long combo after searching the dictionary for the longest word that was a prefix of your combo. But you can run amok if, eg, you have "workmanhours" and you parse it into "workman" "hours" when it maybe should be "work" "man" "hours".

You could possibly finesse your search scheme by indexing selected character sequences rather than words. Eg, build an index of all sequences that start with a leading vowel and then similarly strip your search terms down to a leading vowel.


You'll need to set some rules about how the XML-File must be formated in order to get this working.

I guess you can't manipulate the XML-File (or it is already created and populated)?

If you can (or it's being generated by your code), you'll need to set some rules like

  • Keywords a separated by an ,
  • Keywords have no spaces but use _ instead

With this rules, you'll be able to write a parser which can make sense of your keyword-strings.

If you can't do that, you'll need to parse a keyword and try the different parsings (like "split by _") and see which one makes the best output. But this will be challenging and causes time.

Please also add a sample of your XML-file to your original question.


computer are not intelligent,they understand what you tell 'em.So, it would be easier if you maintain some standard while generating your XML file.otherwise i dont think there is any way to convert "infantmortalityrat" into "infant+mortality+rate"


If you'd have database of strings that can be contained in that string you could do this:

Split the string by separators you can identify (like _,,,-,...) and after, each part could be broken to as many parts as you can identify by sum of shortest strings in DB

like it you have string in 10 chars and shortest string in DB is 4 chars, you can get these combos:

4,6
5,5
6,4 10

no 4,4,2 or sth like this

and after that you can look up each part in DB and if every part is present you can say it is divided into "meaninfull words"

but without that database, or with too common dictionary, you can stuck on this or it could be almost impossible


yes it is possible to split string into words even if there are no split characters. This can be solved pretty efficient near O(n). Consider using prefix string regular expression and extract word by word from you string. You can check this tool as well http://code.google.com/p/graph-expression/wiki/RegexpOptimization.

There are more robust(more effective couse it use global optimisation not local as previos) approach using spell checking automaton which is searching for most propable split of string. Check this tutorial on how its done on Chinese word strings http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号