Improving RODBC-Postgres Write Performance_问答_开发者

Improving RODBC-Postgres Write Performance

开发者 https://www.devze.com 2023-02-27 01:23 出处：网络

I\'ve recently begun using RODBC to connect to PostgreSQL as I couldn\'t get RPostgreSQL to compile and run in Windows x64.I\'ve found that read performance is similar between the two packages, but wr

I've recently begun using RODBC to connect to PostgreSQL as I couldn't get RPostgreSQL to compile and run in Windows x64. I've found that read performance is similar between the two packages, but write performance is not. For example, using RODBC (where z is a ~6.1M row dataframe):

library(RODBC)
con <- odbcConnect("PostgreSQL84")

#autoCommit=FALSE seems to speed things up
odbcSetAutoCommit(con, autoCommit = FALSE)
system.time(sqlSave(con, z, "ERASE111", fast = TRUE))

user  system elapsed
275.34  369.86 1979.59 

odbcEndTran(con, commit = TRUE)
odbcCloseAll()

Whereas for the开发者_如何学Go same ~6.1M row dataframe using RPostgreSQL (under 32-bit):

library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="gisdb", user="postgres", password="...")
system.time(dbWriteTable(con, "ERASE222", z))

user  system elapsed 
467.57   56.62  668.29 

dbDisconnect(con)

So, in this test, RPostgreSQL is about 3X as fast as RODBC in writing tables. This performance ratio seems to stay more-or-less constant regardless of the number of rows in the dataframe (but the number of columns has far less effect). I do notice that RPostgreSQL uses something like COPY <table> FROM STDIN while RODBC issues a bunch of INSERT INTO <table> (columns...) VALUES (...) queries. I also notice that RODBC seems to choose int8 for integers, while RPostgreSQL chooses int4 where appropriate.

I need to do this kind of dataframe copy often, so I would very sincerely appreciate any advice on speeding up RODBC. For example, is this just inherent in ODBC, or am I not calling it properly?

It seems there is no immediate answer to this, so I'll post a kludgy workaround in case it is helpful for anyone.

Sharpie is correct--COPY FROM is by far the fastest way to get data into Postgres. Based on his suggestion, I've hacked together a function that gives a significant performance boost over RODBC::sqlSave(). For example, writing a 1.1 million row (by 24 column) dataframe took 960 seconds (elapsed) via sqlSave vs 69 seconds using the function below. I wouldn't have expected this since the data are written once to disk then again to the db.

library(RODBC)
con <- odbcConnect("PostgreSQL90")

#create the table
createTab <- function(dat, datname) {

  #make an empty table, saving the trouble of making it by hand
  res <- sqlSave(con, dat[1, ], datname)
  res <- sqlQuery(con, paste("TRUNCATE TABLE",datname))

  #write the dataframe
  outfile = paste(datname, ".csv", sep = "")
  write.csv(dat, outfile)
  gc()   # don't know why, but memory is 
         # not released after writing large csv?

  # now copy the data into the table.  If this doesn't work,
  # be sure that postgres has read permissions for the path
  sqlQuery(con,  
  paste("COPY ", datname, " FROM '", 
    getwd(), "/", datname, 
    ".csv' WITH NULL AS 'NA' DELIMITER ',' CSV HEADER;", 
    sep=""))

  unlink(outfile)
}

odbcClose(con)