Alternative to preventing duplicates in importing CSV to CouchDB_问答_开发者

I have 2 big CSV file with millions of rows. Because those 2 CSVs are from MySQL, I want to merge those 2 tables into one Document in couch DB.

What is the most efficient way to do this?

My current method is:

import 1st CSV
import 2nd CSV

To prevent duplication, the program will search the Document with the key for each row. after the row is found, then the Document is updated with the columns from the 2开发者_如何学编程nd CSV

The problem is, it really takes a long time to search for each row. while importing the 2nd CSV, it updates 30 documents/second and I have about 7 million rows. rough calculation, it will take about 64 hours to complete the whole importing.

Thank You

It sounds like you have a "primary key" that you know from the row (or you can compute it from the row). That is ideal as the document _id.

The problem is, you get 409 Conflict if you try to add the 2nd CSV data but there was already a document with the same _id. Is that correct? (If so, please correct me so I can fix the answer.)

I think there is a good answer for you:

Use _bulk_docs to import everything, then fix the conflicts.

Begin with a clean database.

Use the Bulk docuent API to insert all the rows from the 1st and then 2nd CSV set—as many as possible per HTTP query, e.g. 1,000 at a time. (Bulk docs are much faster than inserting one-by-one.)

Always add "all_or_nothing": true in your _bulk_docs POST data. That will guarantee that every insertion will be successful (assuming no disasters such as power loss or full HD).

When you are done, some documents will be conflicted, which means you inserted twice for the same _id value. That is no problem. Simply follow this procedure to merge the two versions:

For each _id that has conflicts, fetch it from couch by GET /db/the_doc_id?conflicts=true.
Merge all the values from the conflicting versions into a new final version of the document.
Commit the final, merged document into CouchDB and delete the conflicted revisions. See the CouchDB Definitive Guide section on conflict resolution. (You can use _bulk_docs to speed this up too.)

Example

Hopefully this will clarify a bit. Note, I installed the *manage_couchdb* couchapp from http://github.com/iriscouch/manage_couchdb. It has a simple view to show conflicts.

$ curl -XPUT -Hcontent-type:application/json localhost:5984/db
{"ok":true}

$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary @-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
            , "first_value": "This is the first value"
            }
          , { "_id": "some_id"
            , "second_value": "The second value is here"
            }
          ]
}
[{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"},{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"}]

$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"some_id","key":["some_id","1-0cb8fd1fd7801b94bcd2f365ce4812ba"],"value":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba"},"doc":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba","first_value":"This is the first value"}},
{"id":"some_id","key":["some_id","1-d1b74e67eee657f42e27614613936993"],"value":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993"},"doc":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993","second_value":"The second value is here"}}
]}

$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary @-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
            , "_rev": "1-0cb8fd1fd7801b94bcd2f365ce4812ba"
            , "first_value": "This is the first value"
            , "second_value": "The second value is here"
            }
          , { "_id": "some_id"
            , "_rev": "1-d1b74e67eee657f42e27614613936993"
            , "_deleted": true
            }
          ]
}
[{"id":"some_id","rev":"2-df5b9dc55e40805d7f74d1675af29c1a"},{"id":"some_id","rev":"2-123aab97613f9b621e154c1d5aa1371b"}]

$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":0,"offset":0,"rows":[]}

$ curl localhost:5984/db/some_id?conflicts=true\&include_docs=true
{"_id":"some_id","_rev":"2-df5b9dc55e40805d7f74d1675af29c1a","first_value":"This is the first value","second_value":"The second value is here"}

The final two commands show that there are no conflicts, and the "merged" document is now served as "some_id".

Another option is simply to do what you are doing already, but use the bulk document API to get a performance boost.

For each batch of documents:

POST to /db/_all_docs?include_docs=true with a body like this:

{ "keys": [ "some_id_1"
          , "some_id_2"
          , "some_id_3"
          ]
}

Build your _bulk_docs update depending on the results you get.
- Doc already exists, you must update it: {"key":"some_id_1", "doc": {"existing":"data"}}
- Doc does not exist, you must create it: {"key":"some_id_2", "error":"not_found"}

POST to /db/_bulk_docs with a body like this:

{ "docs": [ { "_id": "some_id_1"
            , "_rev": "the _rev from the previous query"
            , "existing": "data"
            , "perhaps": "some more data I merged in"
            }
          , { "_id": "some_id_2"
            , "brand": "new data, since this is the first doc creation"
            }
          ]
}