开发者

Testing if a string contains one of several thousand substrings

开发者 https://www.devze.com 2023-04-06 09:02 出处:网络
I\'m going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I\'dlike to look a

I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to selec开发者_开发知识库t the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.

Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.


You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.

The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:

Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).

Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection


If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.


Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.

But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.


Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.

You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.


For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号