开发者

Sorting a 100MB XML file with Java?

开发者 https://www.devze.com 2023-02-20 06:34 出处:网络
How long does sorting a 100MB XML file with Java take ? The file has items with the following structure and I need to sort them by event

How long does sorting a 100MB XML file with Java take ?

The file has items with the following structure and I need to sort them by event

<doc>
    <id>84141123</id>
    <title>kk+ at Hippie Camp</title>
    <description>photo by SFP</description>
    <time>18945840</time>
    <tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
    <geo></geo>
    <event>47409</event>
</doc>

I'm on a Intel Dual Duo Core and 4GB RAM.

Minutes ? Hours ?

开发者_运维技巧

thanks


Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.

Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816

So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.


i would say minutes - you shud be able to do that completely in-memory, so with a sax parser that would be reading-sorting-writing, should not be a problem for your hardware


I think a problem like this would be better sorted using serialisation.

  1. Deserialise the XML file into an ArrayList of 'doc'.

  2. Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.

  3. Serialise out the sorted 'doc' ArrayList to file


If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.

This program should use no more than 4-5x times the original file size. about 500 MB in your case.

String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
    String record = records[i];
    int pos1 = record.indexOf("<id>");
    int pos2 = record.indexOf("</id>", pos1+4);
    long num = Long.parseLong(record.substring(pos1+3, pos2));
    recordMap.put(num, record);
}

StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
    sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());
0

精彩评论

暂无评论...
验证码 换一张
取 消