开发者

Partially Parsing XML File Without XMLParser in JAVA

开发者 https://www.devze.com 2023-03-16 12:02 出处:网络
so I found out it was possible to use the buffered reader/writer to copy an xml file over word for word to a new xml file. However, I was wondering if it would be possible to scrape out only a portion

so I found out it was possible to use the buffered reader/writer to copy an xml file over word for word to a new xml file. However, I was wondering if it would be possible to scrape out only a portion of the document?

For example, looking at this example:

<?xml version="1.0" encoding="UTF-8"?>
<BookCatalogue xmlns="http://www.publishing.org">
    <w:pStyle w:val="TOAHeading" />
    <Book>
    <Title>Yogasana Vijnana: the Science of Yoga</Title>
    <author>Dhirendra Brahmachari</Author>
    <Date>1966</Date>
    <ISBN>81-40-34319-4</ISBN>
    <Publisher>Dhirendra Yoga Publications</Publisher>
    <Cost currency="INR">11.50</Cost>
  </Book>
  <Book>
    <Title>The First and Last Freedom</Title>
    <v:imagedata r:id="rId7" o:title="" croptop="10523f" cropbottom="11721f" /> 
    <Author>J. Krishnamurti</Author>
    <Date>1954</Date>
    <ISBN>0-06-064831-7</ISBN>
    <Publisher>Harper &amp; Row</Publisher>
    <Cost currency="USD">2.95</Cost>
  </Book>
<w:pStyle w:val="TOAHeading2" />
</BookCatalogue> 

Sorry if this is not proper XML Code, I just added the tidbits from the document I was looking at to this sample I found. But basically, if I wanted to look for the an instance of "heading" (in this case, 3rd line -> TOAHeading), then scrape everything from heading down until another instance of heading is found and copy it to another xml file. Is that possible? Furthermore, if I wanted to make that a temporary file I'm storing to, and only keep that file if an instance of "image" (in this case, 14th line) is found, is that possible as well? I'm trying to do this in the simplest way possible, so does anyone have any ideas or experience with this? Thanks in advance.

public class IPDriver 
        {
            public static void main(String[] args) throws IOException
            {
                BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/document.xml"), "UTF-8"));
                BufferedWriter writer = new BufferedWriter(new OutputStreamReader(new FileOutputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/tempdocument.xml"), "UTF-8"));

                String line = null;

                while ((line = reader.readLine()) != null)
                {
                    writer.write(line);
                }

                // Close to unlock.
                reader.close();
                // Close to unlock and flush to disk.
                writer.close();
            }
        }

Example From My Actual XML Document

- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="开发者_如何学JAVAaddress">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="Street">
- <w:r w:rsidRPr="00822244">
  <w:t>6841 Benjamin Franklin Drive</w:t> 
  </w:r>
  </w:smartTag>
  </w:smartTag>
  </w:p>
- <w:p w:rsidR="00B41602" w:rsidRPr="00822244" w:rsidRDefault="00B41602" w:rsidP="007C3A42">
- <w:pPr>
  <w:pStyle w:val="Address" /> 
  </w:pPr>
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="City">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="place">

Just your basic document.xml file from a .docx


You will probably want to read about java XML Parsers. There are two types, SAX parsers and DOM parsers.

SAX parsers are 'event based', meaning that the parser will scan over the xml file for you and call a set of 'callback' methods that you have defined, such as startElement() and endElement(). SAX parsers are efficient for very large xml files.

DOM parsers will read the entire XML into memory and then you can just query the 'DOM object' by calling methods like getElementsByTagName("w:pStyle"). Dom parsers tend to be a bit easier to work with, but use more memory than SAX parsers.

There will be a bit of a learning curve, but these are the standard ways of processing XML in java. There are also libraries designed to simplify the standard libraries, such as JDom.


I've seen a lot of technically-correct suggestions, but your request (when taken as-written) suggests to me that you have the following requirements:

  • Start parsing at a case-insensitive (and potentially PARTIAL) matching of an attribute value; in your case you wanted to match "heading" to the second half of "TAOHeading".
  • Parse from that odd starting condition down to a matching (and equally odd) ending condition.

If I understood your requirements, you are basically wanting to do a totally unstructured parse of a very structured piece of data (XML markup). In that case, using an XML parser, an XSLT, DOM parser for anything written against the XML spec is going to be a pain in the ass to mangle to your needs.

You'll need to do a case-insensitive scan of your document contents until you get your match, then pull all the characters between that match and an ending match.

If the documents aren't huge (say 1 MB or smaller) just read the whole thing into memory into a String and either use a really quick and dirty use of "indexOf" for the different cased versions of what you want, OR read the whole thing into a char[] do write some more efficient scanning code for a case-insensitive match for the starting value you want to begin parsing at.

If I misunderstood your requirement and it is actually much more structured than it sounded in your description above, then please use one of the other suggestions that is more focused on true XML parsing. I am just putting this solution out there in the off chance that it was as random as you made it out to seem.

(NOTE: I'm not saying it's BAD, just never seen that request before. You have your own reasons for needing to do that and we'll just try and help ;)


The proper way to do this would be to use an XSLT transform that emitted everything but what you don't want. This is just what XSLT is mean to do.

Don't parse this by hand it will lead to failure, definitely don't even think of using regular expressions that will lead to epic failure.

If you can't comprehend XLST, and it is a paradigm shift from procedural coding, ask for help here, or fall back on using a traditional XML parsing library for your use case you are going to probably have to use some DOM based parser, I prefer JDOM.


If you are sure that your XML looks like this, you can simply compare each line with <w:pStyle w:val="TOAHeading" />, and then start outputting the following lines, until you find a line which matches <w:pStyle w:val="TOAHeading2" />.

But why would you do this? It is fragile to any formatting changes. Use an XML Parser (and a XML writer), it makes the life much easier.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号