How do I use BeautifulSoup to strip the tags and just deliver the text back into the soup?_问答_开发者

I'm trying to replace any  tags with just the contents in my soup. This is in the middle of other processing that I'm doing using BeautifulSoup.

This is slightly different to a similar question on extracting the text.

Example input:

... </p> ... <p>Here is some text</p> ... and some more

Desired output:

... ... Here is some text ... and some more

And what would I do if I only want to do that processing in say a d开发者_如何学JAVAiv of class="content"?

I don't yet seem to have my BeautifulSoup head on yet!

I didn't use beautifulSoup, but it should be similar to the built in HTMLParser library. This is a class I built to parse input html and convert the tags to a required different "markup".

class BaseHTMLProcessor(HTMLParser):
    def reset(self):                       
        # extend (called by HTMLParser.__init__)
        self.pieces = []
        HTMLParser.reset(self)

    def handle_starttag(self, tag, attrs):
        # called for each start tag
        # attrs is a list of (attr, value) tuples
        # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")]
        # Ideally we would like to reconstruct original tag and attributes, but
        # we may end up quoting attribute values that weren't quoted in the source
        # document, or we may change the type of quotes around the attribute value
        # (single to double quotes).
        # Note that improperly embedded non-HTML code (like client-side Javascript)
        # may be parsed incorrectly by the ancestor, causing runtime script errors.
        # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
        # to ensure that it will pass through this parser unaltered (in handle_comment).
        if tag == 'b': 
            v = r'%b[1]'
        elif tag == 'li': 
            v = r'%f[1]'
        elif tag == 'strong': 
            v = r'%b[1]%i[1]'
        elif tag == 'u': 
            v = r'%u[1]'
        elif tag == 'ul': 
            v = r'%n%'
        else:
            v = ''
        self.pieces.append("{0}".format(v))

    def handle_endtag(self, tag):         
        # called for each end tag, e.g. for </pre>, tag will be "pre"
        # Reconstruct the original end tag.
        if tag == 'li': 
            v = r'%f[0]' 
        elif tag == '/b': 
            v = r'%b[0]'
        elif tag == 'strong': 
            v = r'%b[0]%i[0]'
        elif tag == 'u': 
            v = r'%u[0]'
        elif tag == 'ul': 
            v = ''
        elif tag == 'br': 
            v = r'%n%' 
        else: 
            v = '' # it matched but we don't know what it is! assume it's invalid html and strip it
        self.pieces.append("{0}".format(v))

    def handle_charref(self, ref):         
        # called for each character reference, e.g. for "&#160;", ref will be "160"
        # Reconstruct the original character reference.
        self.pieces.append("&#%(ref)s;" % locals())

    def handle_entityref(self, ref):       
        # called for each entity reference, e.g. for "&copy;", ref will be "copy"
        # Reconstruct the original entity reference.
        self.pieces.append("&%(ref)s" % locals())
        # standard HTML entities are closed with a semicolon; other entities are not
        if htmlentitydefs.entitydefs.has_key(ref):
            self.pieces.append(";")

    def handle_data(self, text):           
        # called for each block of plain text, i.e. outside of any tag and
        # not containing any character or entity references
        # Store the original text verbatim.
        output = text.replace("\xe2\x80\x99","'").split('\r\n')
        for count,item in enumerate(output):
            output[count] = item.strip()
        self.pieces.append(''.join(output))

    def handle_comment(self, text):        
        # called for each HTML comment, e.g. <!-- insert Javascript code here -->
        # Reconstruct the original comment.
        # It is especially important that the source document enclose client-side
        # code (like Javascript) within comments so it can pass through this
        # processor undisturbed; see comments in unknown_starttag for details.
        self.pieces.append("<!--%(text)s-->" % locals())

    def handle_pi(self, text):             
        # called for each processing instruction, e.g. <?instruction>
        # Reconstruct original processing instruction.
        self.pieces.append("<?%(text)s>" % locals())

    def handle_decl(self, text):
        # called for the DOCTYPE, if present, e.g.
        # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        #     "http://www.w3.org/TR/html4/loose.dtd">
        # Reconstruct original DOCTYPE
        self.pieces.append("<!%(text)s>" % locals())

    def output(self):              
        """Return processed HTML as a single string"""
        return "".join(self.pieces)

To use the class, just source it. Then in your code use these lines:

parser = BaseHTMLProcessor()
for line in input:  
    parser.feed(line)
    parser.close()
    output = parser.output()
    parser.reset()
    print output

It works by tokenizing the input stream. Each piece of html that it comes to is dealt with in the appropriate method. So This is bold text! would trigger handle_starttag twice, then handle_data once, then handle_endtag twice. Finally, when the output method is called, it returns the stream contents joined back together.