Find rows that have 3 out of 5 fields in common - how to speed up query?_问答_开发者

The query below works great but is slow. In a table of about 7500 rows it takes about 30s to execute. How could I speed it up?

The goal is to find "almost duplicate" rows within the same table. When there are 3 out of 5 fields matching we have a hit.

SELECT 
originalTable.id,
originalTable.lastname,
originalTable.firstname,开发者_如何转开发
originalTable.address,
originalTable.city,
originalTable.email

FROM
address as originalTable,
address as compareTable

WHERE

# do not find the same record
originalTable.id != compareTable.id and

# at least 3 out of those 5 should match
(originalTable.firstname = compareTable.firstname) +
(originalTable.lastname = compareTable.lastname)  +
(originalTable.address = compareTable.address and originalTable.address != '')  +
(originalTable.city = compareTable.city and originalTable.city != '')  +
(originalTable.email = compareTable.email and originalTable.email != '')
>= 3


GROUP BY
originalTable.id

ORDER BY
originalTable.lastname asc,
originalTable.firstname asc,
originalTable.city asc

Thanks for any optimization hints.

A Cartesian product is required here, that's true. I came up with the following solution:

CREATE TABLE address_dups(INDEX (is_duplicate)) ENGINE=MEMORY   
SELECT 
  originalTable.id,
  compareTable.id,
(
  (originalTable.firstname = compareTable.firstname) +
  (originalTable.lastname = compareTable.lastname)  +
  (originalTable.address = compareTable.address and originalTable.address != '')  +
  (originalTable.city = compareTable.city and originalTable.city != '')  +
  (originalTable.email = compareTable.email and originalTable.email != '')
  >= 3
) AS is_duplicate
FROM 
address as originalTable,
address as compareTable
WHERE originalTable.id != compareTable.id;

SELECT * FROM address_dups WHERE is_duplicate = 1;

This will give you for each row ID the fuzzy duplicate row IDs you request as well.

Your comparison as already noted will require the Cartesian... but only a PARTIAL. Since you are requiring a value in both your first and last name fields, I would have an index on AT LEAST the last name, first name. Then, add a WHERE clause to your condition on partial of last name only... say first 2-3 characters. This way, it will only Cartesian against those in the same name prefix vs the rest. No sense in comparing a "Bill Jones" to "Tonya Smith". However, you might be interested in "Bill Jones" vs "William Jones" at common address, city and/or email. Consider the following name portions for Cartesian comparison.

(names fictitious for sample)
ID  Last     First
1   Adams    Brian
2   Adams    Marsha
3   Andrews  Jeff
4   Brown    Steve
5   Johns    Dave
6   Johnson  Bill
7   Johnson  William

Both the "Adams" would be compared if you only qualified the left 3 of each last name in your where clause. "Andrews" and "Brown" would have no comparison match to anyone. Then, the 3 starting with "Joh" would be Cartesian tested...

Now, add in one MORE to your where clause... Since you have an ID column, make sure that too is part of your where clause. Where the ID of the second table is ALWAYS greater than the one you are on. Ex: When comparing the "Adams" names. You will already know if the ID1 was compared to ID2 as a duplicate or not (not, in this case), so why go backwards and re-compare ID2 to ID1.

So, this sample of 7 records will result in comparisons of

1-2
2-no more to compare against
3-no more to compare against
4-no more to compare against
5-6
5-7
6-7
7-no more to compare against

So a final where would be something like (including the ID that was a close match as basis to look back to.. You could even get all the columns as "MatchFirstName, MatchLastName, MatchCity, etc" just for previewing purposes... )

SELECT 
      originalTable.id,
      originalTable.lastname,
      originalTable.firstname,
      originalTable.address,
      originalTable.city,
      originalTable.email,
      compareTable.ID as MatchID
    FROM
      address as originalTable,
      address as compareTable
    WHERE
          originalTable.ID < CompareTable.ID
      AND left( originalTable.LastName, 3 ) = left( CompareTable.LastName, 3 )
      AND (originalTable.firstname = compareTable.firstname) 
        + (originalTable.lastname = compareTable.lastname) 
        + (originalTable.address = compareTable.address and originalTable.address != '')  
        + (originalTable.city = compareTable.city and originalTable.city != '')  
        + (originalTable.email = compareTable.email and originalTable.email != '') >= 3