I have a list of words in a file. They might contain words like who's, didn't etc. So when reading from it I need to make them proper like "who is" and "did not". This has to be done in Java. I need to do this without losing much time.
This is actually for handling such queries during a search that uses solr.
Below is a sample code I tried using a hash map
Map<String, String> con = new HashMap<String, String>();
con.put("'s", " is");
con.put("'d", " would");
con.put("'re", " are");
con.put("'ll", " will");
con.put("n't", " not");
con.put("'nt", " not");
String temp = null;
String str = "where'd you're you'll would'nt hello";
String[] words = str.split(" ");
int index = -1 ;
for(int i = 0;i<words.length && (index =words[i].lastIndexOf('\''))>-1;i++){
temp = words[i].substring(index);
if(con.containsKey(temp)){
temp = con.get(te开发者_StackOverflow中文版mp);
}
words[i] = words[i].substring(0, index)+temp;
System.out.println(words[i]);
}
If you are worried about queries containing for eg "who's" finding documents containing for eg "who is" then you should look at using a Stemmer, which is designed exactly for this purpose.
You can easily add a stemmer buy configuring it as a filter in your solr config. See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Edit:
A SnowballPorterFilterFactory will probably do the job for you.
Following on from @James Jithin's last remark:
- the "'s" -> " is" transform is incorrect if the word is a possessive form.
- the "'d" -> " would" transform is incorrect in archaic forms, where the "'d" can be a contraction of "ed".
- the "'nt" -> " not" transform is not correct because this is really just a mis-spelling of the "n't" contraction. (I mean "wo'nt" is just plain wrong ... isn't it.)
So, to my mind, the best way to implement this would be to enumerate the small number of contractions that are common and valid, and leave the rest alone. This also has the advantage that you can implement it with a simple string match rather than a suffix match.
The code can be written as
Map<String, String> con = new HashMap<String, String>();
con.put("'s", " is");
con.put("'d", " would");
con.put("'re", " are");
con.put("'ll", " will");
con.put("n't", " not");
con.put("'nt", " not");
String str = "where'd you're you'll would'nt hello";
for(String key : con.keySet()) {
str = str.replaceAll(key + "\\b" , con.get(key));
}
with the logic you have. But suppose its script's
is a word which shows possession, changing it to script is
alters the meaning.
精彩评论