开发者

Python / BeautifulSoup: How to look directly beneath a code comment?

开发者 https://www.devze.com 2023-02-22 05:00 出处:网络
I\'m parsing some webpages with BeautifulSoup and trying to work within the library (instead of trying to solve everything with a brute forced regex..)

I'm parsing some webpages with BeautifulSoup and trying to work within the library (instead of trying to solve everything with a brute forced regex..)

The page I'm looking at is structured like this:

<!--comment--> 
<div>开发者_开发技巧a</div>
<div>b</div>
<div>c</div>
<!--comment--> 
<div>a</div>
<div>b</div
<div>c</div

I want to parse each section individually. Is there a way to tell beautifulsoup to break down the area between identical comments?

Thanks!


Comments are nodes, like anything else:

from BeautifulSoup import BeautifulSoup
from BeautifulSoup import Comment
from BeautifulSoup import NavigableString

text = BeautifulSoup("""<!--comment--><div>a</div><div>b</div><div>c</div>
                        <!--comment--><div>a</div><div>b</div><div>c</div>""")

comments = text.findAll(text=lambda elm: isinstance(elm, Comment))
for comment in comments:
    next_sib = comment.nextSibling
    while not isinstance(next_sib, Comment) and \
        not isinstance(next_sib, NavigableString) and next_sib:
        # This prints each sibling while it isn't whitespace or another comment
        # Append next_sib to a list, dictionary, etc, etc and
        # do what you want with it
        print next_sib 
        next_sib = next_sib.nextSibling

EDIT:

It doesn't detect identical comments (comment text?) but you can solve that by checking if the comment text is identical to the previous comment block.


I do not see any high level API for getting hold of the comment nodes directly in BeautifulSoup. Instead you have to walk over the parse tree yourself.

See 1

The examples show you that you can check the node for 'Comment' class...that's all you got.

Another scary idea:

You may render the document linie by line using soup.prettify() and then parse the generated output line by line, check for comments and feed the following lines manually into BeautifulSoup again.

0

精彩评论

暂无评论...
验证码 换一张
取 消