I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(d开发者_运维知识库ata)
to find links in some html, it is taking a long time on certain html, any optimization advice?
One that it chokes on is http://freeyourmindonline.net/Blog/
Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.
I'd suggest using BeautifulSoup for this task.
How about more straight handling of href's?
re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)
That takes about 0.007 seconds in comparsion with your findall which takes 38.694 seconds on my computer.
 
         
                                         
                                         
                                         
                                        ![Interactive visualization of a graph in python [closed]](https://www.devze.com/res/2023/04-10/09/92d32fe8c0d22fb96bd6f6e8b7d1f457.gif) 
                                         
                                         
                                         
                                         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论