开发者

Algorithm to Merge-Join Large files

开发者 https://www.devze.com 2023-03-27 16:04 出处:网络
Assume that I\'ve got four large files (too large to bring into memory even individually) that has information I need to process.I intend to produce a single application level object (Record) from eac

Assume that I've got four large files (too large to bring into memory even individually) that has information I need to process. I intend to produce a single application level object (Record) from each line in file #1. Files 2-4 each have additional pieces of information required to compose this Record object. For example, the file structure may be as follows:

File #1:

key, description

File #2:

key, metadata, size

File #3:

origin, rate, key

File #4:

key, startDate, endDate

Each file has a single column (of known position within a line) that represents a unique key. This key is shared acros开发者_Python百科s files, but there is no guarantee that each key that exists in any one file, exists in the others, meaning we'll only process the subset of keys that exist in all. The rows of the files are not sorted. Can you devise an algorithm to produce the app-level objects by processing these files?


Using a key-value store database

Databases are the best tools to handle datasets larger than your memory. Put your files into a key-value store (NoSQL DB like CouchDB or Cassandra will be great). Solve your problem using key queries.

Using sort and binary search

If you can't use databases, sort your file according to the key column (this can be easily done using GNU sort). Than you can access your files in nlogn time using the key. Iterate over the largest file, and process each record using calls to the other files. This way your disk reads are likely to ba cached.


You could dump everything into a database (actually, a plain SQL one would be fine), and then delete the "incomplete" records.

To do it file-wise, you can do this:

  • Sort all files by the id key
  • open all sorted files
  • read the first record from each file
  • if you don't have 4 "matching" records, discard the one with the lowest id until you do
  • merge the 4 "matching" records
  • rinse and repeat
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号