开发者

SQLAlchemy: Efficient way of merging duplicate tags

开发者 https://www.devze.com 2023-04-12 01:46 出处:网络
Let\'s say I have a database full of Tag objects. Each Tag has an id and a name. In the beginning of making the database I allowed for case sensitive Tags, however, I later realized I didn\'t need/wan

Let's say I have a database full of Tag objects. Each Tag has an id and a name. In the beginning of making the database I allowed for case sensitive Tags, however, I later realized I didn't need/want this capability, so I started forcing each name to be lowercase before storing the Tag.

Now I have all these remnants of different names which w开发者_如何学Pythonould now be stored under the same Tag but previously weren't. For example,

Trendy, trendy
NotHalfBad, Nothalfbad, nothalfbad
SQL, sql, Sql

I am using Python and SQLAlchemy. I have created a function to clean up this mess that looks something like this:

todelete = []

for t1 in Session.query(Tag):
    if t1 not in todelete: # If we haven't already encountered this tag
        for t2 in Session.query(Tag).filter_by(name_insensitive=t1.name):
            if t1.id != t2.id:
                merge(t1,t2) # Calls a function I made that merges the two tags
                todelete.append(tag)
                Session.commit()

# Mark everything for deletion
for tag in todelete:
    Session.delete(tag)

# Now commit the deletes
Session.commit()

This is horribly inefficient. Is there a better way?


If it's a tool for one-time use, do you really have to care about efficiency? Just let it run for a minute (or several), rather than spending even more time optimizing it.

That being said, queries are more expensive than Python loops, so loading all the Tags into a list first, and then looping over that list both times, should speed things up:

for t1 in tags:
    if t1 not in todelete:
        for t2 in tags:
            if t2.name_insensitive == t1.name:
                merge(t1,t2)
                todelete.append(tag)

Also, remove the commit call in the loop. Not only is it expensive, but if some other process changes the DB, the list of tags you're looping over could get out of sync.

Of course, the a proper way to make things more efficient is profiling first, and then concentrating on specific problems. You should do that if you're serious about performance.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号