开发者

Parsing XML in Java from Wordpress feed

开发者 https://www.devze.com 2023-04-11 05:45 出处:网络
private void parseXml(String urlPath) throws Exception { URL url = new URL(urlPath); URLConnection connection = url.openConnection();
private void parseXml(String urlPath) throws Exception {
    URL url = new URL(urlPath);
    URLConnection connection = url.openConnection();
    DocumentBuilder db = DOCUMENT_BUILDER_FACTORY.newDocumentBuilder();

    final Document document = db.parse(connection.getInputStream());
    XPath xPathEvaluator = XPATH_FACTORY.newXPath();
    XPathExpression nameExpr = xPathEvaluator.compile("rss/channel/item/title");
    NodeList trackNameNodes = (NodeList) nameExpr.evaluate(document, XPathConstants.NODESET);
    for (开发者_运维知识库int i = 0; i < trackNameNodes.getLength(); i++) {
        Node trackNameNode = trackNameNodes.item(i);
            System.out.println(String.format("Blog Entry Title: %s" , trackNameNode.getTextContent()));
        XPathExpression artistNameExpr = xPathEvaluator.compile("rss/channel/item/content:encoded");
        NodeList artistNameNodes = (NodeList) artistNameExpr.evaluate(trackNameNode, XPathConstants.NODESET);
        for (int j=0; j < artistNameNodes.getLength(); j++) {
            System.out.println(String.format(" - Artist Name: %s", artistNameNodes.item(j).getTextContent()));
        }
    }
}

I have this code for parsing the title and content from the default wordpress xml, the only problem is that when I try to get the content of the blog entry, the xml tag is: <content:encoded> and I do not understand how to retrieve this data ?


The tag <content:encoded> means an element with the name encoded in the XML namespace with the prefix content. The XPath evaluator is probably unable to resolve the content prefix to it's namespace, which I think is http://purl.org/rss/1.0/modules/content/ from a quick Google.

To get it to resolve, you'll need to do the following:

  1. Ensure your DocumentBuilderFactory has setNamespaceAware( true ) called on it after construction, otherwise all namespaces are discarded during parsing.
  2. Write an implementation of javax.xml.namespace.NamespaceContext to resolve the prefix to it's namespace (doc).
  3. Call XPath#setNamespaceContext() with your implementation.


You could also try to use XStream, wich is a good and easy to use XML parser. Makes you have almost no work for parsing known XML structures.

PS: Their site is currently offline, use Google Cache to see it =P

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号