I'm having real trouble understanding the specification and guidelines on how to properly escape and encode a URL for submission in a sitemap.
In the sitemap.org (entity escaping) examples, they have an example URL:
http://www.example.com/ümlat.php&q=name
Which when UTF-8 encoded ends up as (according to them):
http://www.example.com/%C3%BCmlat.php&q=name
However, when I try this (rawurlencode) on PHP I end up with:
http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname
I've sort of beaten this by using this function found on PHP.net
$entities = array('%21开发者_如何学编程', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40',
'%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
"$", ",", "/", "?", "#", "[", "]");
$string = str_replace($entities, $replacements, rawurlencode($string));
but according to someone I spoke to (Kohana BDFM), this interpretation is wrong. Honestly, I'm so confused I don't even know what's right.
What's the correct way to encode a URL for use in the sitemap?
Relevant RFC 3986
The problem is that http://www.example.com/ümlat.php&q=name is not a valid url.
(source: RFC 1738, which is obsolete but serves its purpose here, RFC 3986 indeed allows more characters, but no harm is done by escaping characters that don't need escaping)
httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
uchar = unreserved | escape
unreserved = alpha | digit | safe | extra
safe = "$" | "-" | "_" | "." | "+"
extra = "!" | "*" | "'" | "(" | ")" | ","
escape = "%" hex hex
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
So any character except ;:@&=$-_.+!*'(),, a 0-9a-zA-Z character or an escape sequence (e.g. %A0 or, equivalently, %a0) must be escaped. The ? character can appear at most once. The / character can appear in the path portion, but not in the query string. The convention for encoding the other characters is to compute their UTF-8 representation and escape that sequence.
Your algorithm should (assuming the host part is not a problem...):
- extract the path part
- extract the query string part
- for each of those, look for invalid characters
- encode those characters in UTF-8
- pass the result to
rawurlencode - replace the character in the URL with the result of
rawurlencode
加载中,请稍侯......
精彩评论