I have a 1GB Xml file, how can I split it into well-formed, smaller size Xml files using Java ?
Here is an example:
<records>
  <record id="001">
    开发者_Go百科<name>john</name>
  </record>
 ....
</records>
Thanks.
I would use a StAX parser for this situation. It will prevent the entire document from being read into memory at one time.
- Advance the XMLStreamReader to the local root element of the sub-fragment.
- You can then use the javax.xml.transform APIs to produce a new document from this XML fragment. This will advance the XMLStreamReader to the end of that fragment.
- Repeat step 1 for the next fragment.
Code Example
For the following XML, output each "statement" section into a file named after the "account attributes value":
<statements>
   <statement account="123">
      ...stuff...
   </statement>
   <statement account="456">
      ...stuff...
   </statement>
</statements>
This can be done with the following code:
import java.io.File;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamResult;
public class Demo {
    public static void main(String[] args) throws Exception  {
        XMLInputFactory xif = XMLInputFactory.newInstance();
        XMLStreamReader xsr = xif.createXMLStreamReader(new FileReader("input.xml"));
        xsr.nextTag(); // Advance to statements element
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer t = tf.newTransformer();
        while(xsr.nextTag() == XMLStreamConstants.START_ELEMENT) {
            File file = new File("out/" + xsr.getAttributeValue(null, "account") + ".xml");
            t.transform(new StAXSource(xsr), new StreamResult(file));
        }
    }
} 
Try this, using Saxon-EE 9.3.
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:mode streamable="yes"/>
    <xsl:template match="record">
      <xsl:result-document href="record-{@id}.xml">
        <xsl:copy-of select="."/>
      </xsl:result-document>
    </xsl:template>
</xsl:stylesheet>
The software isn't free, but if it saves you a day's coding you can easily justify the investment. (Apologies for the sales pitch).
DOM , STax, SAX all will do but have there own pros and cons.
- You can't put all the data in-memory in case of DOM.
- Programming control is easier in case of DOM then Stax and then SAX.
- A combination of SAX and DOM is a better option.
- Using a Framework which already does this can be the best option. Have a look at smooks.http://www.smooks.org
Hope this helps
I respectfully disagree with Blaise Doughan. SAX is not only hard to use, but very slow. With VTD-XML, you can not only use XPath to simplify processing logic (10x code reduction very common) but also much faster because there is no redundant encoding/decoding conversion. Below is the java code with vtd-xml
import java.io.FileOutputStream;
import com.ximpleware.*; 
public class split {
    public static void main(String[] args) throws Exception {       
        VTDGen vg = new VTDGen();       
        if (vg.parseHttpUrl("c:\\xml\\input.xml", true)) {
            VTDNav vn = vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/records/record");
            int i=-1,j=0;
            while ((i = ap.evalXPath()) != -1) {
            long l=vn.getElementFragment();
                (new FileOutputStream("out"+j+".xml")).write(vn.getXML().getBytes(), (int)l,(int)(l>>32));
                j++;
            }
        }
    }
}
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论