开发者

Mongodb import and deciphering changed rows

开发者 https://www.devze.com 2023-04-10 13:40 出处:网络
I have a large csv file which contains over 30million rows.I need to load this file on a daily basis and identify which of the rows have changed.Unfortunately there is no unique key field but it\'s po

I have a large csv file which contains over 30million rows. I need to load this file on a daily basis and identify which of the rows have changed. Unfortunately there is no unique key field but it's possible to use four of the fields to make it unique. Once I have identified the changed rows I will then want to export the data. I have tried using a traditional SQL Server solution but the performance is so slow it's not going to work. Therefore I have been looking at Mongodb - this has managed to import the file in about 20 minutes (which is fine). Now I don't have any experience using Monogdb and more importantly knowing best practices. So, my idea is the following:

  1. As a one off - Import data into a collection using the mongoimport.

  2. Copy all of the unique id's generated by mongo and put them in a separate collection.

  3. Import new data into 开发者_开发知识库the existing collection using upsert fields which should create a new id for each new and changed row.

  4. Compare the 'copy' to the new collection to list out all the changed rows.

  5. Export changed data.

This to me will work but I am hoping there is a much better way to tackle this problem.


Use unix sort and diff.

Sort the file on disk

sort -o new_file.csv -t ',' big_file.csv
sort -o old_file.csv -t ',' yesterday.csv

diff new_file.csv old_file.csv

Commands may need some tweeking.

You can also use mysql to import the file via

http://dev.mysql.com/doc/refman/5.1/en/load-data.html (LOAD FILE)

and then create KEY (or primary key) on the 4 fields.

Then load yesterday's file into a different table and then use a 2 sql statements to compare the files...

But, diff will work best!

-daniel

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号