开发者

Python and Beautiful Soup - Search for tag a, return following tag b's until tag A is found

开发者 https://www.devze.com 2023-04-03 04:22 出处:网络
I have 2 variables, one with \'last volume\' and the other with \'last issue\'. The HTML I am dealing with contains a list of all volumes and issues, most recent first.

I have 2 variables, one with 'last volume' and the other with 'last issue'.

The HTML I am dealing with contains a list of all volumes and issues, most recent first.

I need to return the href links for all volumes and issues that are newer than what I have on file.

So using t开发者_如何学Gohe below example, say my last volume is 13 and last issue is 1, I would need to return the href for vol 13, 2 and vol 14, 1.

I am having a hard time with this since the volume is on its own...

Here is what I have so far:

HTML:

<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>          
</li>
<li><strong>Volume 13</strong></li> 
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>

Script Snipped:

results = soup.find('ul', attrs={'class' : 'bobby'})

#temp until I get it reading from file
lastVol = '13'
#find the last volume
findlastVol = results.findNext('strong', text= re.compile('Volume ' + lastVol))

#temp until I get it reading from file
lastIss = '2'
#find the last issue
findlastIss = findlastVol.findNext('a', text= re.compile('Issue ' + lastIss))

So I can get to the tag for the last volume and issue on file, but I have had several failed attempts at traversing back up and stopping at the first issue...

Or starting at the top and traversing down until that volume and issue condition is met...

Can someone please give me some assistance? Thanks.


I think you are looking for findPrevious, which you could use this way:

import BeautifulSoup
import re

content='''
<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September 2011">Issue 1, September 2011</a>          
</li>
<li><strong>Volume 13</strong></li> 
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''

last_volume=13
last_issue=1

soup=BeautifulSoup.BeautifulSoup(content)
results = soup.find('ul', attrs={'class' : 'bobby'})
for a_string in results.findAll('a', text=re.compile('Issue')):
    volume=a_string.findPrevious(text=re.compile('Volume'))
    volume=int(re.search(r'(\d+)',volume).group(1))
    issue=int(re.search(r'(\d+)',a_string).group(1))
    href=a_string.parent['href']
    if (volume>last_volume) or (volume>=last_volume and issue>last_issue):    
        print(volume,issue,href)

yields

(14, 1, u'/content/ben/cchts/2011/00000014/00000001')
(13, 2, u'/content/ben/cchts/2010/00000013/00000002')


from BeautifulSoup import BeautifulSoup
content = '''<ul class="bobby">
<li><strong>Volume 14</strong></li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, September     2011">Issue 1, September 2011</a>          
</li>
<li><strong>Volume 13</strong></li> 
<li class="">
<a href="/content/ben/cchts/2010/00000013/00000002" title="Issue 2, December 2010">Issue 2, December 2010</a>
</li>
<li class="">
<a href="/content/ben/cchts/2011/00000014/00000001" title="Issue 1, November 2011">Issue 1, November 2011</a>
</li>
</ul>
'''
soup = BeautifulSoup(content)
soup.prettify()
last_vol = 13
last_issue = 1

res = soup.find('ul',{"class":"bobby"})
lis = res.findAll('li')
for j in lis:
    if(j.find('strong') != None):
        vol = int(j.contents[0].string[7:])
    elif(vol > last_vol) or (vol == last_vol and int(j.contents[1]['href'][33:]) > last_issue): 
        print "Volume\t:%d" % vol
        print j.contents[1].string
        print "href\t:%s" % j.contents[1]['href']

Gives

Volume  :14  
Issue 1, September 2011  
href    :/content/ben/cchts/2011/00000014/00000001  
Volume  :13  
Issue 2, December 2010  
href    :/content/ben/cchts/2010/00000013/00000002 
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号