开发者

parse html with XML::LibXML while not touching entities

开发者 https://www.devze.com 2023-04-12 05:32 出处:网络
I\'m using XML::LibXML to parse a chunk of html in order to change the title attribute of all the anchor elements.

I'm using XML::LibXML to parse a chunk of html in order to change the title attribute of all the anchor elements. The problem is that XML:开发者_开发百科:LibXML tampers with un-encoded entites, and changes e.g '&' to '&' in the url params in the href attributes.

How do i tell XML::LibXML to not try to encode or decode any of these entitites?

#!/usr/bin/perl -w

use strict;
use XML::LibXML;

my $parser = XML::LibXML->new(recover => 2);

my $html = '
<div>
    <span>this & that &amp; what?</span>
    <a title="link1" href="http://url.com/foo?a=1&b=2">Link1</a>
    <a title="link2" href="http://url.com/foo?a=1&b=2">Link2</a>
</div>';

my $doc = $parser->load_html(string => $html);

for my $node ($doc->findnodes('//*[@title]')) {
    $node->setAttribute('title', 'newtitle');
}

print $doc->toString(), "\n";

__END__

which produces this output:

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
    <span>this &amp; that &amp; what?</span>
    <a title="newtitle" href="http://url.com/foo?a=1&amp;b=2">Link1</a>
    <a title="newtitle" href="http://url.com/foo?a=1&amp;b=2">Link2</a>
</div></body></html>

As you'll see XML::LibXML has altered the urls, and also the text inside the span tag!


As you'll see XML::LibXML has altered the urls, and also the text inside the span tag!

You are mistaken. The URL did not change. Both the original HTML and the generated HTML produce the same URL (http://url.com/foo?a=1&b=2). The HTML is different, but the text displayed is not.

The same goes for the text in the span. Both the original HTML and the generated HTML produce the same URL (this & that & what?). The HTML is different, but the URL is not.

To my knowledge, there's no way to control what characters XML::LibXML's toString escapes. Apparently, it chooses to escape &amp; even when it's not technically required in HTML.

Any why not? There's no harm in having "&" escaped.

«this & that &amp; what?» and «this &amp; that &amp; what?» mean the same in HTML.

«href="http://url.com/foo?a=1&amp;b=2"» and «href="http://url.com/foo?a=1&b=2"» mean the same in HTML.

PS — If you want to produce HTML, you should be using ->toStringHTML(), not ->toString(). The latter produces XML.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号