I'm scraping a page, however before my content is echoed I would like to edit the link.
What is the best way to do this?
I'm currently using Simple HTML DOM Parser:
// create HTML DOM
$html = file_get_html('http://myurl.com');
// remove all image
foreach($html->find('img') as $e)
$e->outertext = '';
foreach($html->find('font') as $e)
$e->outertext = '';
// find all td tags with attribite align=center
foreach($html->find('td[align=left]', 2) as $e)
echo $e->innertext;
There is this bit in one of the URLs:
<a target="retailer" href="/cgi-bin/redirect.cgi?name=Storm%20Computers&linkid=2&newurl=http%3A%2F%2Fwww.stormcomputers.com.au%2Fcatalog%2Findex.php%3FcPat开发者_如何学运维h%3D38_364&query=sandy%20bridge&uca=208-0-0&kwi=&rpos=2" title="Storm Computers: Click to see item">$149.00</a>
I would like to change this to
<a href="http%3A%2F%2Fwww.stormcomputers.com.au%2Fcatalog%2Findex.php%3FcPath%3D38_364&query=sandy%20bridge&uca=208-0-0&kwi=&rpos=2">$149.00</a>
(ie. just after &newurl=)
Thanks.
I'm not familiar with the parser you're using, but something like this might work:
foreach ($html->find('a') as $link) {
$urlparts = parse_url($link->href);
$query = parse_str($urlparts['query'], $params);
if (isset($params['newurl'])) {
$link->href = $params['newurl'];
}
}
Find the links with DOM. After that just use explode to split the href string.
$split_href = explode('&newurl=', $href);
if(count($split_href) > 1) {
$newurl = $split_href[1];
}
Don't think you need regex, cause it is slower.
You can use a regular expression to find all the links and then parse_url()
and parse_str()
to rebuild the link.
For example:
if (preg_match_all('/<a href="(.+)">(.+)<\/a>/i',$html,$matches)) {
// at this point, $matches is a multidimensional array where
// index 0 is an array of all matches of the full pattern,
// and index 1 is an array of all captured links
foreach ($matches[1] as $link) {
// parse the link
if ($parsed_link = parse_url($link)) {
// see the documentation of parse_url() for the various
// array keys produced by calling it; in this case we
// are using the value of 'query' and passing it to
// parse_str() which will break a url query string
// into individual variables; pass $arguments as below
// and it will populate the result into it as an array
parse_str($parsed_link['query'],$arguments);
// now, we want the value of the 'newurl' query parameter
// from the original url
if (isset($arguments['newurl'])) {
$new_url = $arguments['newurl'];
// do whatever you want with $new_url
}
}
}
}
This is certainly not the only way to do this but there is some value in using the language features for consistency and readability. I didn't put much/any thought into the Regular Expression above for finding links, so it does not handle any special cases. If the links in your document are not well formed you may need to modify that expression to handle extra whitespace, misplaced quotes, etc.
精彩评论