开发者

Python sort unique list of lists' items

开发者 https://www.devze.com 2023-04-08 14:54 出处:网络
I can\'t seem to find a question on SO about my particular problem, so forgive me if this has been asked before!

I can't seem to find a question on SO about my particular problem, so forgive me if this has been asked before!

Anyway, I'm writing a script to loop through a set of URL's and give me a list of unique urls with un开发者_如何学Goique parameters.

The trouble I'm having is actually comparing the parameters to eliminate multiple duplicates. It's a bit hard to explain, so some examples are probably in order:

Say I have a list of URL's like this

  • hxxp://www.somesite.com/page.php?id=3&title=derp
  • hxxp://www.somesite.com/page.php?id=4&title=blah
  • hxxp://www.somesite.com/page.php?id=3&c=32&title=thing
  • hxxp://www.somesite.com/page.php?b=33&id=3

I have it parsing each URL into a list of lists, so eventually I have a list like this:

sort = [['id', 'title'], ['id', 'c', 'title'], ['b', 'id']]

I nee to figure out a way to give me just 2 lists in my list at that point:

new = [['id', 'c', 'title'], ['b', 'id']]

As of right now I've got a bit to sort it out a little, I know I'm close and I've been slamming my head against this for a couple days now :(. Any ideas?

Thanks in advance! :)

EDIT: Sorry for not being clear! This script is aimed at finding unique entry points for web applications post-spidering. Basically if a URL has 3 unique entry points

['id', 'c', 'title']

I'd prefer that to the same link with 2 unique entry points, such as:

['id', 'title']

So I need my new list of lists to eliminate the one with 2 and prefer the one with 3 ONLY if the smaller variables are in the larger set. If it's still unclear let me know, and thank you for the quick responses! :)


I'll assume that subsets are considered "duplicates" (non-commutatively, of course)...

Start by converting each query into a set and ordering them all from largest to smallest. Then add each query to a new list if it isn't a subset of an already-added query. Since any set is a subset of itself, this logic covers exact duplicates:

a = []
for q in sorted((set(q) for q in sort), key=len, reverse=True):
    if not any(q.issubset(Q) for Q in a):
        a.append(q)
a = [list(q) for q in a] # Back to lists, if you want
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号