开发者

How can I speed up a diff between tables?

开发者 https://www.devze.com 2023-03-12 10:53 出处:网络
I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB...

I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB... My current query is:

SELECT * FROM tableA EXCEPT SELECT * FROM tableB;

and

SELECT * FROM tableB EXCEPT SELECT * FROM tableA;

When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.

I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.

Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours. Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)

Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.

What am I doing wrong here, I can't poss开发者_StackOverflow中文版ibly see why with the index my net diff time is larger than without it?

This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries

EDIT: Here is the schema for the two tables, they are identical except the table name.

CREATE TABLE bulk.blue
(
  "partA" text NOT NULL,
  "type" text NOT NULL,
  "partB" text NOT NULL
)
WITH (
  OIDS=FALSE
);


In the statements above you are not using the indexes.

You could do something like:

SELECT * FROM tableA a
  FULL OUTER JOIN tableB b ON a.someID = b.someID

You could then use the same statement to show which tables had missing values

SELECT * FROM tableA a
  FULL OUTER JOIN tableB b ON a.someID = b.someID
  WHERE ISNULL(a.someID) OR ISNULL(b.someID)

This should give you the rows that were missing in table A OR table B


Confirm you indexes are being used (they are likely not in such a generic except statement), but you are not joining against a specified column(s) so likely that lack of explicit join will not make for an optimized query:

http://www.postgresql.org/docs/9.0/static/indexes-examine.html

This will help you view the explain analyze more clearly:

http://explain.depesz.com

Also, make sure you do an analyze on the table after you create the index if you want it to perform well right away:}


The queries as specified require a comparison of every column of the tables.

For example if tableA and tableB each have five columns then the query is having to compare tableA.col1 to tableB.col1, tableA.col2 to tableB.col2, . . . tableA.col5 to tableB.col5

If there are just few columns that uniquely identify a record instead of all the columnS in the table then joining the tables on the specific columns that uniquely identify a record will improve your performance.

The above statement assumes that a primary key has not been created. If a primary key has been defined to indicated which columns uniquely identify a record then I believe the EXCEPT statement would take that into consideration.


  • What kind of index did you apply? Indexes are only useful to improve WHERE conditions. If you're doing a select *, you're grabbing all the fields and the index is probably not doing anything, but taking up space, and adding a little more processing behind the scenes for the db-engine to compare the query to the index cache.

  • Instead of SELECT *, you can try selecting your unique fields and create an index for those unique fields

  • You can also use an OUTER JOIN to show results from both tables that did not match on the unique fields
  • You may want to consider is clustering your tables
  • What version of Postgres are you running?
  • When was the last time you vacuumed?

Other than the above, 13GB is pretty large, so you'll want to check your config settings. It shouldn't take hours to run that, unless you don't have enough memory on your system.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号