I'm looking for suggestions for scaling a points leaderboard system. I already have a working version using a very normalized strategy. This first version was essentially a table which looked something like this.
UserPoints - PK: (UserId,Date) +------------+--------+---------------------+ | UserId | Points | Date | +------------+--------+---------------------+ | 1 | 10 | 2011-03-17 07:16:36 | | 2 | 35 | 2011-03-17 08:09:26 | | 3 | 40 | 2011-03-17 08:05:36 | | 1 | 65 | 2011-03-17 09:01:37 | | 2 | 16 | 2011-03-17 10:12:35 | | 3 | 64 | 2011-03-17 12:51:33 | | 1 | 300 | 2011-03-17 12:19:21 | | 2 | 1200 | 2011-03-17 13:24:13 | | 3 | 510 | 2011-03-17 17:29:32 | +------------+--------+---------------------+
I then have a stored procedure which basically does a GroupBy UserID and Sums the Points. I can also pass @StartDate and @EndDate parameters to create a leaderboard for a specific time period. For example, time windows for Top Users for the Day / Week / Month / Lifetime.
This seemed to work well with a moderate amount of data, but things became noticeably slower as the number of points records passed a million or so. The test data I'm working with is just over a million point records created by about 500 users distributed over a timespan of 3 months.
Is there a different way to approach this? I have experimented with denormalizing the data by pre-grouping the points into hour datetime buckets to reduce the number of rows. But I'm starting to think the real problem I need to worry about is the increasing number of users that need to be accounted for in the leaderboard. The time window sizes will generally be small but more and more users will start generating points within any given window.
Unfortunately I don't have access to 'Jobs' since I'm using SQL Azure and the Agent is not available (yet). But, I am open to the idea of scaling this using a different storage system if you are convincing enough.
My past work experience tells me I should look into data warehousing since this is almost a reporting problem. But at the same time I need it to be as real-time as possible.
Update
Ultimately, I would like to support custom leaderboards that could span from Monday 8am - Friday 6pm every week. But that's down the road and why I'm trying to not get too fancy with the aggregation. I'm wi开发者_JS百科lling to settle with basic Day/Week/Month/Year/AllTime windows for now.
The tricky part is that I really can't store them denormalized because I need these windows to be TimeZone convertible. The system is mult-tenant and therefore all data is stored as UTC. The problem is a week starts at different hours for different customers. Aggregating the sums together will cause some points to fall into the wrong buckets.
here are a few thoughts:
- Sticking with SQL Azure: you can have another table, PointsTotals. Every time you add a row to your UserPoints table, also increment the TotalPoints value for a given UserId in PointsTotals (or insert a new row if they don't have a row to increment). Now you always have totals computed for each UserId.
- Going with Azure Table Storage: Create a UserPoints table, with Partition Key being userId. This keeps all of a user's points rows together, where you'd easily be able to sum them. And... you can borrow the idea from suggestion #1, creating a separate PointsTotals table, with PartitionKey being UserId and RowKey probably being the total points.
If it were my problem, I'd ignore the timestamps and store the user and points totals by day
I decided to go with the idea of storing points along with a timespan (StartDate and EndDate columns) localized to the customer's current TimeZone setting. I realized an extra benefit with this is that I can 'purge' old leaderboard round data after a few monts without affecting the lifetime total of points.
精彩评论