开发者

lxml: how to discard all <li> elements containing a link with particular class?

开发者 https://www.devze.com 2023-03-24 11:36 出处:网络
As is often the case, I\'m struggling with the lack of proper lxml documentation (note to self: should write a proper lmxl tutorial and get lots of traffic!).

As is often the case, I'm struggling with the lack of proper lxml documentation (note to self: should write a proper lmxl tutorial and get lots of traffic!).

I want to find all <li> items that do not contain an <a> tag with a particular class.

For example:

<ul>
<li><small>pudding</small>: peache开发者_如何学JAVAs and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>

I'd like to get hold of only the <li> that does not contain a link with class new, and I'd like to get hold of the text inside <small>. In other words, 'pudding'.

Can anyone help?

thanks!


import lxml.html as lh

content='''\
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>
'''

tree=lh.fromstring(content)
for elt in tree.xpath('//li[not(descendant::a[@class="new"])]/small/text()'):
    print(elt)

# pudding

The XPath has the following meaning:

//                        # from the root node, look at all descendants
li[                       # select nodes of type <li> who
    not(descendant::a[    # do not have a descendant of type <a>
        @class="new"])]   # with a class="new" attribute 
    /small                # select the node of type <small>
    /text()               # return the text of that node


Quickly hacked together this code:

from lxml import etree
from lxml.cssselect import CSSSelector

str = r"""
<ul>
<li><small>pudding</small>: peaches and <a href="/cream">cream</a></li>
<li><small>cheese</small>: Epoisses and <a href="/st-marcellin" class="new">St Marcellin</a></li>
</ul>"""

html = etree.HTML(str)

bad_sel = CSSSelector('li > a.new')
good_sel = CSSSelector('li > small')

bad = [item.getparent() for item in bad_sel(html)]
good = filter(lambda item: item.getparent() not in bad, [item for item in good_sel(html)])

for item in good:
  print(item.text)

It first builds a list of items you do not want, and then it builds the ones you do want by excluding the bad ones.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号