Optimal database design in terms of query speed to store matrices from R_问答_开发者

I have hundreds of matrices that need to be used in R and most of them are around 45000x350 each. What I'd like to do is find an optimal database software choice and schema to store the data in and be able to call subsets of the matrices from the database. This needs to be done with the object of extracting the data be as fast as possible.

As a base here is code that creates 5 matrices similar to what I'm dealing with:

if(!"zoo" %in% installed.packages()[,1]) { install.packages("zoo") }
require("zoo", quietly=TRUE)

numSymbols <- 45000
numVariables <- 5
rDatePattern <- "%d/%m/%Y"
startDate <- "31/12/1982"
endDate <- "30/09/2011"
startYearMonth <- as.yearmon(startDate,format=rDatePattern)
alphaNumeric <- c(1:9,toupper(letters))             
numMonths <- (as.yearmon(endDate,format=rDatePattern)-startYearMonth)*12
numValues <- numSymbols*(numMonths+1)

dateVector <- sapply(1:(numMonths+1), function(x) {as.character(format(as.Date(startYearMonth+x*1/12,fraq=0)-1,rDatePattern))})

symbolNames <- sapply(1:numSymbols,function(x) {as.character(paste((sample(alphaNumeric,7)),collapse=""))})
                    
for(i in 1:numVariables) {
    assign(paste("Variable",i,sep="_"),matrix(sample(c(rnorm(numValues/2),rep(NA,numValues/2))),
                                              nrow=numSymbols,
                                              ncol=(numMonths+1),
                                              dimnames=list(symbolNames,dateVector)))
}

Basically all the matrices will have about half the values filled with doubles and the rest NAs.

# > ls()[grepl("Variable_",ls())]
# [1] "Variable_1" "Variable_2" "Variable_3" "Variable_4" "Variable_5"

# > dim(Variable_1)
# [1] 45000   346

# > Variable_1[1:10,1:3]
#                  31/12/1982          31/01/1983           28/02/1983
# AF3HM5V                  NA                  NA -1.15076100366945755
# DL8TVIY                  NA                  NA -1.59412257037490046
# JEFDYPO                  NA                  NA                   NA
# 21ZV689                  NA                  NA -0.31095014405320764
# RP1DZHB -1.0571670785223215 -0.7206356272944392 -0.84028668343265112
# N6DUSZC                  NA                  NA -1.31113363079930023
# XG3ZA1W                  NA  0.8531074740045220  0.06797987526470438
# W1JCXIE  0.2782029710832690 -1.2668560986048898                   NA
# X3RKT2B  1.5220172324681460 -1.0460218516729356                   NA
# 3EUB8VN -0.9405417187846803  1.1151437940206490  1.60520458945005262

I want to be able to store these in a database. RDBMS would be the default option but I'm willing to look at other options. The biggest part is the optimal solution for quick querying, be it for the whole matrix or a subset of the matrix, e.g. 2000 symbols, 100 dates etc.

The current solution I've been using is saving each matrix as a RData file and then loading in the whole matrix and truncating in size to use. This is really quick but I feel a database design would be more benefitial in terms of scaling the matrices in terms of symbols + dates and backups for the data.

What I've tried so far in terms of RDBMS options is:

- Fields: Symbol, Variable, Date, Value
- Seperate and clustered indices for all but value.
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 4-8 minutes

- Each variable in a sperate table.
- Fields: Symbol, Date, Value
- Seperate and clustered indices for all but value.
- Views added to cache common subsets (dunno if it helped at all...)
- Data needs to be "melted"/pivoted to a mxn matrix for R (crazy memory inefficient)
- Average query for a normal sample into R: 3-5 minutes

C) [Should maybe have tried a column based database here]

- Symbols and dates stored seperately and map to row and col numbers only
- Each variable in a seperate table with symbols for rows and dates for columns
- Really bad for where data maps to disk when scaling rows and cols.
- Data already in correct format for R
- Average query for a normal sample into R: 1-3 minutes

In comparison with the above database set ups loading in the whole variable from the RData files takes 5 sec locally and 20 sec over the network. All the database times are over the network.

Is there anything I can do开发者_如何学JAVA to make the database route come anywhere close to the binary file speeds?

Maybe one of the tabular nosql databases is what I need?

How does that scale in terms of additional symbols + dates?

Any help from someone who's dealt with something similar would be appreciated.

Update: Thought I'd post an update to this. In the end I went with Iterator's suggestion and the data is now hosted in bigmemory memory mapped files and then RData for quick use drag and drop checking as well as outputted to csv and pulled by SQL Server for backup purposes. Any database solution is too slow to be used by multiple users. Also using RODBC against SQL server is crazy slow, but tried input and output to R via CSV to and from SQL and that was okay but pointless.

Also for references, byte compiling the the load method for bigmemory does have an impact. Here are the results of my load test for RData vs bigmemory.

workingDirectory <- "/Users/Hans/92 Speed test/"
require("bigmemory")
require("compiler")
require("rbenchmark")

LoadVariablesInFolder <- function(folder, sedols, dates) {
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.RData",filesInFolder)]
    filesToLoad <- paste(folder,filesToLoad,sep="/")
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        loadedVar <- load(fToLoad)
        assign(loadedVar,get(loadedVar)[sedols,dates])
        gc() -> ans
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar))
    }
    return(variablesThatWereLoaded)
}

cLoadVariablesInFolder <- cmpfun(LoadVariablesInFolder)

BigMLoadVariablesInFolder <- function(folder, sedols, dates) {
    workD <- getwd()
    setwd(folder)
    filesInFolder <- dir(folder)
    filesToLoad <- filesInFolder[grepl(".*NVAR_.*\\.desc",filesInFolder)]
    variablesThatWereLoaded <- c()
    for(fToLoad in filesToLoad) {
        tempVar <- attach.big.matrix(dget(fToLoad))
        loadedVar <- gsub(".*(NVAR_\\d+).*","\\1",fToLoad,perl=TRUE)
        assign(loadedVar,tempVar[sedols,dates])
        variablesThatWereLoaded <- c(variablesThatWereLoaded,loadedVar)
        rm(list=c(loadedVar,"tempVar"))
        gc() -> ans
    }
    setwd(workD)
    return(variablesThatWereLoaded)
}

cBigMLoadVariablesInFolder <- cmpfun(BigMLoadVariablesInFolder)

testCases <- list(
                list(numSedols=1000,numDates=120),
                list(numSedols=5000,numDates=120),
                list(numSedols=50000,numDates=120),
                list(numSedols=1000,numDates=350),
                list(numSedols=5000,numDates=350),
                list(numSedols=50000,numDates=350))

load(paste(workingDirectory,"dates.cache",sep="/"))
load(paste(workingDirectory,"sedols.cache",sep="/"))

for (testCase in testCases) {
    results <- benchmark(LoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              BigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              cBigMLoadVariablesInFolder(folder=workingDirectory,sedols=sedols[1:testCase$numSedols],dates=dates[1:testCase$numDates]),
              columns=c("test", "replications","elapsed", "relative"),
              order="relative", replications=3)
    cat("Results for testcase:\n")
    print(testCase)
    print(results)
}

Basically the smaller the subset the more is gained because you don't spend time loading in the whole matrix. But loading the whole matrix is slower from bigmemory than RData, I guess it's the conversion overhead:

# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   6.799999999999955  1.000000000000000
# 3            3  14.389999999999986  2.116176470588247
# 1            3 235.639999999999986 34.652941176470819
# 2            3 250.590000000000032 36.851470588235543
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 120

                                                                                                                              # test
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications             elapsed           relative
# 4            3   7.080000000000155  1.000000000000000
# 3            3  32.730000000000018  4.622881355932105
# 1            3 249.389999999999873 35.224576271185654
# 2            3 254.909999999999854 36.004237288134789
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 120

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 146.3499999999999 1.000000000000000
# 4            3 148.1799999999998 1.012504270584215
# 2            3 238.3200000000002 1.628425008541171
# 1            3 240.4600000000000 1.643047488896482
# Results for testcase:
# $numSedols
# [1] 1000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications            elapsed          relative
# 3            3  83.88000000000011 1.000000000000000
# 4            3  91.71000000000004 1.093347639484977
# 1            3 235.69000000000005 2.809847401049115
# 2            3 240.79999999999973 2.870767763471619
# Results for testcase:
# $numSedols
# [1] 5000

# $numDates
# [1] 350

                                                                                                                              # test
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 3            3 135.6999999999998 1.000000000000000
# 4            3 155.8900000000003 1.148784082535008
# 2            3 233.3699999999999 1.719749447310245
# 1            3 240.5599999999995 1.772733971997051
# Results for testcase:
# $numSedols
# [1] 50000

# $numDates
# [1] 350

                                                                                                                              # test
# 2     cLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 1      LoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 3  BigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
# 4 cBigMLoadVariablesInFolder(folder = workingDirectory, sedols = sedols[1:testCase$numSedols], dates = dates[1:testCase$numDates])
  # replications           elapsed          relative
# 2            3 236.5000000000000 1.000000000000000
# 1            3 237.2100000000000 1.003002114164905
# 3            3 388.2900000000000 1.641818181818182
# 4            3 393.6300000000001 1.664397463002115

I would strongly recommend using HDF5. I assume that your data is complex enough that a variety of bigmemory files (i.e. memory mapped matrices) would not easily satisfy your needs (see note 1), but HDF5 is just short of the speed of memory mapped files. See this longer answer to another question to understand how I compare HDF5 and .RDat files.

Most notably, the fact that HDF5 supports random access means that you should be able to get substantial speed improvements.

Another option, depending on your willingness to design your own binary format, is to use readBin and writeBin, though this doesn't have all of the nice features that HDF5 has, including parallel I/O, version information, portability, etc.

Note 1: If you have just a few types per row, i.e. 1 character and the rest are numeric, you can simply create 2 memory mapped matrices, one of which is for characters, the other for numeric values. This will allow you to use bigmemory, mwhich, bigtabulate and lots of other nice functions in the bigmemory suite. I'd give that a reasonable effort, as it's a very easy system to smoothly integrate with lots of R code: the matrix need never enter memory, just whatever subsets you happen to need, and many instances can access the same files simultaneously. What's more, it is easy to parallelize access using multicore backends for foreach(). I used to have an operation that would take about 3 minutes per .Rdat file: about 2 minutes to load, about 20 seconds to subselect what I needed, about 10 seconds to analyze, and about 30 seconds to save the results. After switching to bigmemory, I got down to about 10 seconds to analyze and about 5-15 seconds on the I/O.

Update 1: I overlooked the ff package - this is another good option, though it is a lot more complex than bigmemory.

Maybe the database design of the TSdbi package is inspiring...

For a nosql solution, hdf5 might be an option. I do not know much about it, though.

You could also delve into the uses/internals of sqldf. It seems they got their database work pretty well sorted out. There are also a lot of interesting examples given on their page.