n-largest elements in a sequence (need to retain duplicates)_问答_开发者

I need to find the n largest elements in a list of tuples. Here is an example for top 3 elements.

# I have a list of tuples of the form (category-1, category-2, value)
# For each category-1, ***values are already sorted descending by default***
# The list can potentially be approximately a million elements long.
lot = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4',  8), ('a', 'x5', 8), ('a', 'x6', 7),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8), 
       ('b', 'x4',  7), ('b', 'x5', 6), ('b', 'x6', 5)]

# This is what I need. 
# A list of tuple with top-3 largest values for each category-1
ans = [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), 
       ('a', 'x4', 8), ('a', 'x5', 8),
       ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8)]

I tried using heapq.nlargest. However it only returns the first 3 largest elements and doesn't return duplicates. For example,

heapq.nlargest(3, [10, 10, 10, 9, 8, 8, 7, 6])
# returns
[10, 10, 10]
# I need
[10, 10, 10, 9, 8, 8]

I can only think of a brute force approach. This is what I have and it works.

res, prev_t, count = [lot[0]], lot[0], 1
for t in lot[1:]:
    if t[0] == prev_t[0]:
        count = count + 1 if t[2] != prev_t[2] else count
        if count <= 3:
            res.append(t)   
    else:
        count = 1
        res.append(t)
    prev_t = t

print res

Any other ideas on how I can implement this?

EDIT: timeit results for a list of 1 million elements show that mhyfritz's solution runs in 1/3rd the 开发者_运维技巧time of brute force. Didn't want to make the question too long. So added more details in my answer.

I take it from your code snippet that lot is grouped w.r.t. category-1. Following should work then:

from itertools import groupby, islice
from operator import itemgetter

ans = []
for x, g1 in groupby(lot, itemgetter(0)):
    for y, g2 in islice(groupby(g1, itemgetter(2)), 0, 3):
        ans.extend(list(g2))

print ans
# [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), ('a', 'x4', 8), ('a', 'x5', 8),
#  ('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8)]

If you already have the input data sorted that way then is very probably that your solution is a little better than the heapq based one.

Your algorithm complexity is O(n) while the heapq based one is conceptually O(n * log(3)) and it will probably need more passes over the data to arrange it properly.

Some additional details ... I timed both mhyfritz's excellent solution that uses itertools and and my code (brute-force).

Here are the timeit results for n = 10 and for a list with 1 million elements.

# Here's how I built the sample list of 1 million entries.
lot = []
for i in range(1001):
    for j in reversed(range(333)):
        for k in range(3):
            lot.append((i, 'x', j))

# timeit Results for n = 10
brute_force = 6.55s
itertools = 2.07s
# clearly the itertools solution provided by mhyfritz is much faster.

In case anyone is curious, here is a trace of how his code works.

+ Outer loop - x, g1
| a [('a', 'x1', 10), ('a', 'x2', 9), ('a', 'x3', 9), ('a', 'x4', 8), ('a', 'x5', 8), ('a', 'x6', 7)]
+-- Inner loop - y, g2
  |- 10 [('a', 'x1', 10)]
  |- 9 [('a', 'x2', 9), ('a', 'x3', 9)]
  |- 8 [('a', 'x4', 8), ('a', 'x5', 8)]
+ Outer loop - x, g1
| b [('b', 'x1', 10), ('b', 'x2', 9), ('b', 'x3', 8), ('b', 'x4', 7), ('b', 'x5', 6), ('b', 'x6', 5)]
+-- Inner loop - y, g2
  |- 10 [('b', 'x1', 10)]
  |- 9 [('b', 'x2', 9)]
  |- 8 [('b', 'x3', 8)]

How's about this? It doesn't exactly return your desired result, since it reverse-sorts on y.

# split lot by first element of values
lots = defaultdict(list)
for x, y, z in lot:
    lots[x].append((y, z))

ans = []
for x, l in lots.iteritems():
    # find top-3 unique values
    top = nlargest(3, set(z for (y, z) in l))
    ans += [(x, y, z) for (z, y) in sorted([(z, y) for (y, z) in l
                                                   if z in top],
                                           reverse=True)]

print ans

from collections import *

categories = defaultdict(lambda: defaultdict(lambda: set()))
for t in myTuples:
    cat1,cat2,val = t
    categories[cat1][val].add(t)

def onlyTopThreeKeys(d):
    keys = sorted(d.keys())[-3:]
    return {k:d[k] for k in keys}

print( {cat1:onlyTopThreeKeys(sets) for cat1,sets in categories.items()} )

Result:

{'a': {8: {('a', 'x5', 8), ('a', 'x4', 8)},
       9: {('a', 'x3', 9), ('a', 'x2', 9)},
       10: {('a', 'x1', 10)}},
 'b': {8: {('b', 'x3', 8)}, 
       9: {('b', 'x2', 9)}, 
       10: {('b', 'x1', 10)}}}

flat list: I did the method above because it gives you more information. To just get a flat list, use closures to emit results with onlyTopThreeKeys:

from collections import *

def topTiedThreeInEachCategory(tuples):
    categories = defaultdict(lambda: defaultdict(lambda: set()))
    for t in myTuples:
        cat1,cat2,val = t
        categories[cat1][val].add(t)

    reap = set()

    def sowTopThreeKeys(d):
        keys = sorted(d.keys())[-3:]
        for k in keys:
            for x in d[k]:
                reap.add(x)
    for sets in categories.values():
        sowTopThreeKeys(sets)

    return reap

Result:

>>> topTiedThreeInEachCategory(myTuples)
{('b', 'x2', 9), ('a', 'x1', 10), ('b', 'x3', 8), ('a', 'x2', 9), ('a', 'x4', 8), ('a', 'x3', 9), ('a', 'x5', 8), ('b', 'x1', 10)}

You can also use itertools.groupby if your input is guaranteed to be sorted as in your sample input, but this will cause your code to break if the sorting ever changes.

This is the idea, make a dict with the value you want to sort by as the key and a list of the tuples that have that value as the values.

Then sort the items of the dict by the keys, get the items from the top, extract their values and join them.

Quick, ugly code:

>>> sum(
        map(lambda x: x[1],
            sorted(dict([(x[2], filter(lambda y: y[2] == x[2], lot))
                for x in lot]).items(),
                reverse=True)[:3]),
    [])

7: [('a', 'x1', 10),
 ('b', 'x1', 10),
 ('a', 'x2', 9),
 ('a', 'x3', 9),
 ('b', 'x2', 9),
 ('a', 'x4', 8),
 ('a', 'x5', 8),
 ('b', 'x3', 8)]

Just to give you some ideas, hope it helps. If you need some clarification ask in the comments