开发者

url regex issues

开发者 https://www.devze.com 2023-01-11 04:49 出处:网络
I\'m using this regex 开发者_StackOverflow(((ht|f)tp(s?))\\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\\-\\.]+\\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\;\\?\\\'\

I'm using this regex 开发者_StackOverflow(((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\;\?\'\\\+&%\$#\=~_\-]+))* to search for urls, the only problem, is it's finding "you ca" is a url, how do I change it so there HAS to be a period before the ending (in this case the 'ca') so 'you ca' wont work anymore but 'you.ca' will


Parsing uris with regexes is a hard problem.

Either use a library like Regexp::Common::URI or prepare to spend lots of time investigating a bunch of RFCs. Parsing URIs is entirely not trivial and there are lots of subtle mistakes to be made.


You forgot to escape the periods in the (www.|[a-zA-Z].) block.


I use a freeware to check my regex: http://www.weitz.de/regex-coach/

perhaps it can be helpfull to you


John Gruber's regexp is the best so far in my experience at finding URLs. See his article on his blog: An Improved Liberal, Accurate Regex Pattern for Matching URLs. It's in use in lots of production code. There's two version: one matches any URL while another only matches http/https URLs.


You can use a quantifier for the period character, so '\.{1}' would require exactly one period before whatever follows.

It's not something that's a necessary part of the debugging of this problem, but it may help to know about it. It's just more explicit, and '{1}' is bigger than a dot, so it also serves as a separator in long, ugly regexes where, during debugging, you might accidentally throw a "+" or "*" next to the dot.

0

精彩评论

暂无评论...
验证码 换一张
取 消