开发者

How to speed up comparing MD5 hashes in a database

开发者 https://www.devze.com 2023-03-20 10:47 出处:网络
I have a database full of classified adverts for computers for sale that have come from many different sites. The database is populated by XML files that are received from individual sites advertising

I have a database full of classified adverts for computers for sale that have come from many different sites. The database is populated by XML files that are received from individual sites advertising, and then parsed and placed in a central table in the database.

The XML files have the following attributes for each computer: Make, Model, HD Size, RAM Size, Processor Speed, Price, Location etc.

The central database table then has the same columns, plus an extra one at the start which is an ID field for our own use.

Because the adverts are created by the public, they often place the adverts on one or more of our partner sites, therefore creating duplicate adverts advertising the same computer, and I need to identify the adverts that are duplicated in the database.

The problem with identifying the adverts is that there is no unique field (unlike, say, a car with a Reg Number).

An idea I've had is to add an extra column to the table that holds an MD5 hash of the contents of the other columns. When the XML is parsed, the MD5 hash is created of each advert and then added as a final column.

Once the records have been added (100k upwards) to the table, a query is run to identify any matching MD5 hashes but this takes too long, and often times out the query (even if the ti开发者_如何转开发meout has been extended)

My question then is: Is the MD5 hash route the best one? If so, how would I speed the querying up? If not, what would be the best way of identifying the duplicate adverts?

Thanks,


We use MD5 hashes to quickly identify rows and have hundreds of millions of rows of data, so I would say this is an appropriate choice.

Do you have an index on the column holding the MD5? Note that it could be a non-unique index if you want to hold the duplicate submissions in the table, or a unique index if you want to prevent a duplicate from being inserted.

If you're still not getting the speed you need, you can consider using a 64-bit hash. Some people do that for very high volume apps since it cuts the size of the indexed field in half. I doubt you'll need to do that for the volume you mention.

Keep in mind that the slightest change to the text of the ad will result in a new MD5 value (even an extra space). If there can be formatting changes, you may want to normalize the data before doing the MD5, e.g. by removing all whitespace, punctuation and consistently casing the data.


Seems like adding index on hash column can help.


Well since you're only interested in finding duplicates using MD5 is probably really not the best choice here. Remember that MD5 was engineered as a cryptographic hash and speed wasn't the foremost goal for this (actually many modern secure hashes are made SLOW by design!).

I'd personally implement a simple hash myself and use that. As noted by Eric J. you have to normalize your data anyhow before using it and after the normalizing step your just run it through your hash function and use this.

The simplest one would be to handle all fields as strings and just use the usual string hash algorithm:

s[0]*KEY^(n-1) + s[1]*KEY^(n-2) + ... + s[n-1]

with key being usually some low prime number (iirc for a usual english dictionary 31 or 49 result in the lowest collisions, but since your hash is computed of several fields that will probably not matter). This is simple and fast to implement - and also means you use a word sized hash which should also be faster.

Anyways back to your actual problem: Adding an index (non-unique!) will be the simplest solution, but I'd test if it's faster to activate the index only after adding all files (which means the DB will have to sort the files once, but would be faster when inserting) - you'd have to test that stuff.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号