开发者

Gapless secondary ID for fetching random rows in MySQL

开发者 https://www.devze.com 2023-03-03 12:55 出处:网络
In several projects I\'ve been working on I have encountered the need to fetch random rows from large (>1M rows) tables. With tables this large, ORDER BY rand() LIMIT 1 is no option as it will quickly

In several projects I've been working on I have encountered the need to fetch random rows from large (>1M rows) tables. With tables this large, ORDER BY rand() LIMIT 1 is no option as it will quickly bring the database to it's knees.

The usual solution has been to generat a random number between MIN(id) and MAX(id) and select that row directly. However, if there are big gaps in the id sequence this will require either lots of re-rolls or using WHERE id >= :myrandomnumber which will lead to rows that succeed large gaps getting significantly more hits than average.

I've been thinking to solve this problem by creating a new indexed column solely for randomizing purposes, say id2. This column would always be a gapless sequence between 1 and the number of rows in the table.

Question: What would be the best way to keep this sequence gapless?

The first solution that comes to mind is creating a helper table recycled_ids that will contain columns tablename and id2. Whenever a row is deleted from tablename, the id2 of that row is inserted to recycled_ids. When new rows are inserted, the id2 is either selected from recycled_ids or if none are available, a new one is created. Is there a simpler way?

Bonus quest开发者_如何学编程ions: Are there ORMs or frameworks that already do this or have otherwise efficient random row selection? Is this an existing pattern, does it have a name?


Update: I wrote a quick benchmark for this and ran it against a table with 125,000 rows and 30,000 gaps between them. The results are pretty promising:

Fetch a random row 100 times using id2: 0.0234689712524 seconds
Fetch a random row 100 times using ORDER BY rand() LIMIT 1: 54.992347002 seconds

When inserting the test data, I removed one random row for every five rows inserted. The sequence stays gapless the whole time.

for($i=1; $i<=$amount; $i++) {
    insert_row();
    if($i % 5 == 0)
        delete_random_row();
}

Running that loop again with $amount = 10000 takes 9 seconds on my low-end vserver. That's 0.009 seconds per row and it includes deleting a random row every five iterations. It does get slower as the table grows, but fetching a random row does not.

My original questions still apply.


Here's how I'd do it -

  1. Select the MAX(id) from your table
  2. In PHP (or whatever language you're using), generate a random integer between 1 and MAX(id)
  3. SELECT * FROM table WHERE id >= $random ORDER BY id ASC LIMIT 1
  4. If 3 returns nothing, SELECT * FROM table WHERE id < $random ORDER BY id DESC LIMIT 1

Avoids running any queries that would be brutally slow. It also avoids the extra column which, keeping gapless, would be a nasty job indeed!


Ranking to the rescue I'd say.

SET @rank:= 1;  

SELECT * FROM
  (
  SELECT @rank:= @rank + 1 as rank, * FROM table1  
  ) s
WHERE s.rank = $random;
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号