开发者

Optimising a lot of inserts with JDBC and MySQL

开发者 https://www.devze.com 2023-04-07 03:14 出处:网络
I have to perform a large number of inserts (in this instance 27k) and I want to find an optimal to do this. Right now this is the code that I have. As you can see I\'m using prepared statement and ba

I have to perform a large number of inserts (in this instance 27k) and I want to find an optimal to do this. Right now this is the code that I have. As you can see I'm using prepared statement and batches and I'm executing every 1000 ( I have also tried with a lesser number such as 10 and 100 but the time was again way to long). One thing which is omitted from the query is that there is an auto-generated ID if it is of any matter to the issue:

private void parseIndividualReads(String file, DBAccessor db) {
    BufferedReader reader;
    try {
        Connection con = db.getCon();
        PreparedStatement statement = null;
        statement = con.prepareStatement("INSERT INTO `vgsan01_process_log`.`contigs_and_large_singletons` (`seq_id` ,`length` ,`ws_id` ,`num_of_reads`) VALUES (?, ?, ?, ?)");
        long count = 0;
        reader = new BufferedReader(new FileReader(logDir + "/" + file));
        String line;
        while ((line = reader.readLine()) != null) {
            if(count != 0 && count % 1000 == 0)
                statement.executeBatch();

            if (line.startsWith(">")) {
                count++;
                String res[] = parseHeader(line);
                statement.setString(1, res[0]);
                statement.setInt(2, Integer.parseInt(res[1]) );
                statement.setInt(3, id);
                statement.setInt(4, -1);
                statement.addBatch(); 
            }
        }

        statement.executeBatch();
    } catch (FileNotFoundException ex) {
        Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error opening file: " + file, ex);
    } catch (IOException ex) {
        Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error reading from file: " + file, ex);
    } catch (SQLException ex) {
        Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error inserting individual statistics " + file, ex);
    }
}

Any other tips regarding what might be changed in order to speed up the process. I mean a single insert statement doesn't have much information - I'd say no more than 50 characters for all 4 columns

EDIT:

Okay following the advice given I have restructured the method as follows. The speed up is immense. You could even try and play with the 1000 value which might yield better results:

private void parseIndividualReads(String file, DBAccessor db) {
    BufferedReader reader;
    PrintWriter writer;
    try {
        Connection con = db.getCon();
        con.setAutoCommit(false);
        Statement st = con.createStatement();
        StringBuilder sb = new StringBuilder(10000);

        reader = new BufferedReader(new FileReader(logDir + "/" + file));
        writer = new PrintWriter(new BufferedWriter(new FileWriter(logDir + "/velvet-temp-contigs", true)), true);
        String line;
        long count = 0;
        while ((line = reader.readLine()) != null) {

            if (count != 0 && count % 1000 == 0) {
                sb.deleteCharAt(sb.length() - 1);
                st.executeUpdate("INSERT INTO `vgsan01_process_log`.`contigs_and_large_singletons` (`seq_id` ,`length` ,`ws_id` ,`num_of_reads`) VALUES " + sb);
                sb.delete(0, sb.capacity());
               开发者_开发问答 count = 0;
            }
            //we basically build a giant VALUES (),(),()... string that we use for insert
            if (line.startsWith(">")) {
                count++;
                String res[] = parseHeader(line);
                sb.append("('" + res[0] + "','" + res[1] + "','" + id + "','" + "-1'" + "),");
            }
        }

        //insert all the remaining stuff
        sb.deleteCharAt(sb.length() - 1);
        st.executeUpdate("INSERT INTO `vgsan01_process_log`.`contigs_and_large_singletons` (`seq_id` ,`length` ,`ws_id` ,`num_of_reads`) VALUES " + sb);
        con.commit();
    } catch (FileNotFoundException ex) {
        Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error opening file: " + file, ex);
    } catch (IOException ex) {
        Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error reading from file: " + file, ex);
    } catch (SQLException ex) {
        Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error working with mysql", ex);
    }
}


You have other solutions.

  1. Use LOAD DATE INFILE from mySQL Documentation.
  2. If you want to do it in Java, use only one Statement that inserts 1000 values in 2 order : "INSERT INTO mytable (col1,col2) VALUES (val1,val2),(val3,val4),..."

But I would recommend the 1st solution.


The fastest way to do what you want to do is to load directly from the file (http://dev.mysql.com/doc/refman/5.5/en/load-data.html).

There are some problems with loading from a file - firstly, the file needs to be readable by the server, which often isn't the case. Error handling can be a pain, and you can end up with inconsistent or incomplete data if the data in your file does not meet the schema's expectations.

It also depends on what the actual bottleneck is - if the table you're inserting to has lots going on, you may be better using a combination of insert delayed (http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html).

The official line on speeding up inserts is here: http://dev.mysql.com/doc/refman/5.5/en/insert-speed.html


Depending on the structure of your data you have a potential bug in your "execute batch every 1000 iterations" logic.

If the frequency of lines beginning with ">" is low, then you could have an instance where the following happens (loading for lots of unnecessary executeBatch calls:

line in data file      events in program
-----------------------------------------------------------
> some data            (count=999)
> some more data       (count=1000)
another line           (execute batch, count=1000)
more unprocessed       (execute batch, count=1000)
some more              (execute batch, count=1000)

So I would move the if(count != 0 && count % 1000 == 0) inside the if (line.startsWith(">")) block.

Note, I'm note sure whether this can happen in your data or how much of a speed-up it would be.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号