开发者

Efficiently count percentage of occurrence in MongoDB

开发者 https://www.devze.com 2023-02-07 02:56 出处:网络
So, I\'m tinkering with MongoDB, and I\'m trying to get the count() aggregation queryi开发者_Go百科ng to scale properly, to allow me to easily calculate the percentage of occurrence of certain values

So, I'm tinkering with MongoDB, and I'm trying to get the count() aggregation queryi开发者_Go百科ng to scale properly, to allow me to easily calculate the percentage of occurrence of certain values in the document across the collection.

I have a document with a structure like:

{
    foo : 'bar',
    moo : 'cow',
    values : {
        alpha : true,
        beta : false,
        gamma : false,
        delta : true ... (many more)
    }
}

Now, I have several thousand of these documents, and I want to efficiently calculate the percentage of true (or the percentage of false) of all the values in the values object (and in my case, there are ~50). ie, what percentage of the time alpha is true, beta is true, etc.

I started naively with count(), but it seems like it only allows one query at a time, so that led me to do this (using the PHP Mongo class, but its basically just a regular count() function:

 $array_of_keys = array('alpha', 'beta', 'gamma', 'delta'...);
 for($i=0;$i<count($array_of_keys);$i++){
    $array_of_keys = [...]
    for($i=0;$i<count($array_of_keys);$i++){

$false  = intval($collection->count(array($array_of_keys[$i]=>false)));
$true  = intval($collection->count(array($array_of_keys[$i]=>true)));
}

But even with a very small number of records (around 100), this took 9 seconds.

What's the best approach for this?


Here is a simple MapReduce that will do what you want:

map = function() {
    for (var key in this.values){
        emit(key, {count:1, trues: (this.values[key] ? 1 : 0)});
    }
}

reduce = function(key, values){
    var out = values[0];
    for (var i=1; i < values.length; i++){
        out.count += values[i].count;
        out.trues += values[i].trues;
    }
    return out;
}

finalize = function(key, value){
    value.ratio = value.trues / value.count;
    return value;
}

db.runCommand({mapReduce:'collection',
               map:map,
               reduce:reduce,
               finalize:finalize,
               out:'counts'
               })

db.counts.findOne({_id:'alpha'})
{_id: 'alpha', value: {count: 100, trues: 52, ratio: 0.52}}

You could also do an upsert like this when you insert into your main collection which will give you a real-time view into your data:

for (var key in this.values){
    db.counts.update({_id:key},
                     {$inc:{count:1, trues: (this.values[key] ? 1 : 0)}},
                     true);
}

In fact, you could even combine these methods. Do a one-time MapReduce batch job to populate the counts collection and then use upserts to keep it up to date.

0

精彩评论

暂无评论...
验证码 换一张
取 消