开发者

Multiple strings matching performance

开发者 https://www.devze.com 2023-04-07 01:24 出处:网络
I have an artists table with over 100,000 records that I use to match against an array (between 1 and several thousands) of artists submitted by the user. My current query looks like this:

I have an artists table with over 100,000 records that I use to match against an array (between 1 and several thousands) of artists submitted by the user. My current query looks like this:

SELECT id from artists WHERE lower(name) IN(downcase_artists)

This does the job fine, but I'm wondering whether it could be made faster. Query time varies between a few hundred ms to sometimes 10 whole seconds when it's matching thousands of artists. The name column is indexed. (does that even make a difference on string columns?)

I was thinking maybe something like Redis would speed this up? By keeping a key-value store of the artist name and its corresponding id?

Is there any other option that I'm missing that would speed this up?

EDIT: as James suggested, I tried implementing some kind of all_artists cached method (using the memcache add-on on heroku) and use it to match my strings against it:

artist_ids = self.all_cached.select{|a| downcase_array.include?(a.name)}.collect(&:id)

I gained very small db query time but total request time increased drastically:

Before: Completed 200 OK in 1853ms (Views: 164.2ms | ActiveRecord: 1476.3ms)  
After: Completed 200 OK in 12262ms (Views: 169.2ms | ActiveRecord: 1200.6ms)
开发者_如何学Python

I'm getting similar results when I run it locally:

Before: Completed 200 OK in 405ms (Views: 75.6ms | ActiveRecord: 135.4ms)
After: Completed 200 OK in 3205ms (Views: 245.1ms | ActiveRecord: 126.5ms)

Putting the ActiveRecord times aside, it looks like taking the string matching off the query worsened my issue (and that's with as few as 100 strings). Or am I missing something?

I also had a look at full-text search engines such as Sphinx but they definitely sound overkill, since I'm only searching through 1 single column...

EDIT 2: I finally managed to decrease request time down to

Before: Completed 200 OK in 1853ms (Views: 164.2ms | ActiveRecord: 1476.3ms)  
Now: Completed 200 OK in 226ms (Views: 127.2ms | ActiveRecord: 48.7ms)

using a redis store of json strings (see full answer)


The usage of IN can be quite costly, if I remember right. How about this:

caches_action :find_all_artists

def gather_artist_ids
  @all_artists = Artist.all(:select => "id,name)
end

then, wherever you want the query:

@downcase_artists = "Joe Schmo, Sally Sue, ..."
@requested_artists = @all_artists.select{|a| @downcase_artists.include?(a)}.collect(&:id)

You could do a cache_action on the :gather_artist_ids and have your sweeper only triggered after_create, after_update, and after_destroy.

MongoDB: I use MongoDB via Mongoid and have 1.51 million records in it and a regex search /usersinput/i takes less than 100ms with an index where needed. It's exceptionally fast.


Since you're storing the artists' names in lower case, and you're searching for the full artist name, then this should work for you. I will state the Redis commands, you should easily find the corresponding API call in your client (Use redis-cli first, it will clear things up for you).

I will suppose your table Artists has 3 records: 'The Reign of Kindo', 'Dream Theater' and 'A.C.T', corresponding ids 1, 2, 3.

The basic idea is load that table in a sorted set. Each member's score will be the id of the artist, and the member string will be the artist's name:


Loading phase, filling up the sorted set with all artists (note the lower case):

ZADD artists 1 "the reign of kindo"
ZADD artists 2 "dream theater"
ZADD artists 3 "a.c.t"

Now I will query Redis for the first two bands. The idea is to build this time a temporary sorted set (query:10), which will be intersected with the artists sorted set.

Why not just use query as a key ? I am assigning each query an (arbitrarily) id, so there is no collision between concurrent user searches! Also, we can refer to the id later when caching the result set for a period (more on that below).

The use of : as a delimiter is a recommended convention (look here).


Query phase, filling up the query sorted set.

ZADD query:10 0 "the reign of kindo"
ZADD query:10 0 "dream theater"
ZINTERSTORE result_query:10 2 artists query:10 WEIGHTS 1 0
EXPIRE result_query:10 600

The score for the query sorted set doesn't matter, so it can be 0.

With ZINTERSTORE, we store in result_query:10 the intersection of 2 keys, artists and query:10. But there is a catch! How are scores from both keys combined into the final sorted set ?

Answer: Redis by default sums them.

Now, we can use the WEIGHTS attribute to zero the scores we don't want. So WEIGHTS 1 0 is saying that only the score for artists will be summed.

Now we have the matching artists in result_query:10, which EXPIRE makes it last for 10 minutes. You can figure out a smart way to use this cache =)


Getting the result set

So doing all of the above, you can get the desired result with ZRANGE:

redis> zrange result_query:10 0 -1 withscores
1) "the reign of kindo"
2) "1"
3) "dream theater"
4) "2"

The interval 0 -1 means get all members, and withscores attribute makes ZRANGE returns the ids (scores) of each member, together with their strings.

Hope that all makes sense. It's only the tip of the iceberg for Redis. Good benchmarking and see ya!


I'd consider a full-text search engine (Sphinx, Ferret, Lucene, etc.), some of which end up giving you more interesting search capabilities. Unless you'll always just want to search on artist name etc.

Then I'd consider just keeping a wad of memory available to perma-cache the names and hit that instead of the DB.


Remove the function "lower(..)" from the query.


I ended up using Redis to store not only artist ids and names but the whole json response I return to the user. My Redis hash looks like this:

{"all_artists" => ["artistname1" => "json_response1", "artistname2" => "json_response2"...]}

I do the matching using the following (redis-rb gem):

REDIS.hmget("all_artists", *downcase_array)

That returns all the json strings (including the artist id, name, and upcoming concerts) for the corresponding artists without ever hitting the db. I'm obviously updating the Redis hash every time artists or concerts are updated.

And the resulting time difference (for 100 artists):

Before: Completed 200 OK in 1853ms (Views: 164.2ms | ActiveRecord: 1476.3ms)  
Now: Completed 200 OK in 226ms (Views: 127.2ms | ActiveRecord: 48.7ms)

There is still some optimization left to be done but the string matching is definitely out of the way now.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号