开发者

Python regular expression to strip script tags

开发者 https://www.devze.com 2023-02-04 03:32 出处:网络
I\'m a little scared to ask this for fear of retribution from the SO \"You can\'t parse HTML with regular expressions\" cult. Why does re.subn(r\'<(script).*?</\\1>\', \'\', data, re.DOTALL)

I'm a little scared to ask this for fear of retribution from the SO "You can't parse HTML with regular expressions" cult. Why does re.subn(r'<(script).*?</\1>', '', data, re.DOTALL) not strip the multiline 'script' but only the two single-line ones at the end, please?

Thanks, HC

>>> import re
>>> data = """\
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script> 
    <script type="text/javascript" src="../_static/doctools.js"></script>
"""

>>> print (re.subn(r'<(script).*?</\1>', '', data, re.DOTALL)[0])
<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title>开发者_高级运维; 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 


Leaving aside the question of whether this is a good idea in general, the problem with your example is that the fourth parameter to re.subn is count - there's no flags parameter in Python 2.6, although it was introduced as a fifth parameter in Python 2.7. Instead you can add `(?s) to the end of your regular expression for the same effect:

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', data)[0])

<nothtml> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 




>>>

... or if you're using Python 2.7, this should work:

>>> print (re.subn(r'<(script).*?</\1>(?s)', '', 0, data)[0])

... i.e. inserting 0 as the count parameter.


Just in case it's of interest, I thought I'd add an additional answer showing two ways of doing this with lxml, which I've found very nice for parsing HTML. (lxml is one of the alternatives that the author of BeautifulSoup suggests, in light of the problems with the most recent version of the latter library.)

The point of adding the first example is that it's really very simple and should be much more robust than using a regular expression to remove the tags. In addition, if you want to do any more complex processing of the document, or if the HTML you're parsing is malformed, you have a valid document tree that you can manipulate programmatically.

Remove all script tags

This example is based on the HTMLParser example from lxml's documentation:

from lxml import etree
from StringIO import StringIO

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

parser = etree.HTMLParser()
tree = etree.parse(StringIO(broken_html), parser)

for s in tree.xpath('//script'):
    s.getparent().remove(s)

print etree.tostring(tree.getroot(), pretty_print=True)

That produces this output:

<html>
  <head>
    <title>Regular Expression HOWTO &#8212; Python v2.7.1 documentation</title>
  </head>
</html>

Use lxml's Cleaner module

On the other hand, since it looks as if you're trying to remove awkward tags like <script> perhaps the Cleaner module from lxml will also do other things you'd like:

from lxml.html.clean import Cleaner

broken_html = '''
<html> 
  <head> 
    <title>Regular Expression HOWTO &mdash; Python v2.7.1 documentation</title> 
    <script type="text/javascript"> 
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.7.1',
        COLLAPSE_MODINDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script> 
    <script type="text/javascript" src="../_static/jquery.js"></script>
'''

cleaner = Cleaner(page_structure=False)
print cleaner.clean_html(broken_html)

... which produces the output:

<html><head><title>Regular Expression HOWTO — Python v2.7.1 documentation</title></head></html>

(n.b. I've changed nothtml in your example to html - with your original, method 1 works fine, but wraps everything in <html><body>, but method 2 doesn't work for reasons I don't have time to figure out right now :))


In order to remove html, style and script tages, you can use re.

def stripTags(text):
  # scripts = re.compile(r'<script.*?/script>')
  scripts = re.compile(r'<(script).*?</\1>(?s)')
  css = re.compile(r'<style.*?/style>')
  tags = re.compile(r'<.*?>')

  text = scripts.sub('', text)
  text = css.sub('', text)
  text = tags.sub('', text)

I can work easily


The short answer, is don't do that. Use Beautiful Soup or elementree to get rid of them. Parse your data as HTML or XML. Regular expressions won't work and are the wrong answer to this problem.

0

精彩评论

暂无评论...
验证码 换一张
取 消