UPDATE 2: http://htmlpurifier.org/phorum/read.php?3,5088,5113 Author has already identified the problem.
UPDATE: Issue appears to be exclusive to version 4.2.0. I have downgraded to 4.1.0 and it works. Thank you for all your help. Author of package notified.
I am scraping some pages like:
http://form.horseracing.betfair.com/horse-racing/010108/Catterick_Bridge-GB-Cat/1215
According to W3C validation it is valid XHTML Strict.
I am then using http://htmlpurifier.org/ to purify the HTML before loading into a DOMDocument. However it is only returning a single line of content.
Output:
12:15 Catterick Bridge - Tuesday 1st January 2008 - Timeform | Betfair
Code:
echo $content; # all good
$purifier = new \开发者_StackOverflowHTMLPurifier();
$content = $purifier->purify($content);
echo $content; # all bad
BTW it works for data sourced from another site, just as you say leaves the title for all pages from this domain.
Related Links
- HTMLPurifier dies when the following code is run through it (unanswered question on similar topic)
You should not need the HTML purifier. The DOMDocument class will take care of everything for you. However, it will trigger a warning on invalid html, so just do this:
$doc = new DOMDocument();
@$doc->loadHTML($content);
Then the error will not be triggered, and you can do what you wish with the HTML.
If you are scraping links, I would recommend that you use SimpleXMLElement::xpath(); That is much easier than working with the DOMDocument. Another example on that:
$xml = new SimpleXMLElement($content);
$result = $xml->xpath('a/@href');
print_r($result);
You can get much more complex xpaths that allow you to specifiy class names, ids, and other attributes. This is much more powerful than DOMDocument.
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论