I'm trying to get this scrape images off of wikipedia. What good is free licensed media if you can't get it? Original script is here.
If you put this
http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png
in firefox, it will immediately be transformed into
http://upload.wikimedia.org/wikipedia/commons/2/26/的-bw.png
so that开发者_高级运维 when you save the image, it's saved as 的-bw.png
Simple enough eh? Now how to get php to do that? Just guessing, I tried utf8_decode($fileName) .. but getting the wrong Chinese characters.
$src= "http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png";
$pngData = file_get_contents($src);
$fileName = basename($src);
file_put_contents($fileName, $pngData);
Any help appreciated, as I really have no idea where to go from here.
Have you tried url_decode();
?
<?php
$url = 'http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png';
$parts = explode('/', $url);
$title = $parts[count($parts)-1]; //get last section
$title = urldecode($title);
?>
Squirrelmail contains a nice function in the sources to convert unicode to entities:
<?php
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
?>
精彩评论