BeautifulSoup or regex HTML table to data structure?_问答_开发者

BeautifulSoup or regex HTML table to data structure?

开发者 https://www.devze.com 2023-01-16 16:08 出处：网络

I\'ve got an HTML table that I\'m trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse

相关专题：python regex

I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like

<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>

into

[['1,1', '1,2'],
 ['2,1', '2,2']]

Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:

    <td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&amp;style=L&amp;positioning=A&amp;adddirect=yes&amp;accessid=CreateNewEdit&amp;filterblock=N&amp;popeditform=yes&am开发者_Go百科p;returncalendar=student_center/sc_all_rooms')"
     class="listdefaultmonthbg" 
     style="cursor:crosshair;" 
     width="5%" 
     nowrap="1" 
     rowspan="1">
       <a class="listdatelink" 
          href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&amp;display=W&amp;positioning=A&amp;filterblock=N&amp;adddirect=yes&amp;accessid=CreateNewEdit">Sep 5</a>
    </td>

And what the code really looks like is even worse. All I really need out of there is:

<td rowspan="1">Sep 5</td>

Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:

<tr>
  <td rowspan="2">Sep 5</td>
  <td>Some event</td>
</tr>
<tr>
  <td>Some other event</td>
</tr>

would end out like this:

[["Sep 5", "Some event"],
 [None, "Some other event"]]

There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).

I was thinking of a process something like this:

Find table
Count rows in tables (len(table.findAll('tr'))?)
Create list
Parse table into list (BeautifulSoup syntax???)
???
Profit! (Well, it's a purely internal program, so not really... )

There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.

http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827

You'll probably need to identify the table with some attrs, id or name.

from BeautifulSoup import BeautifulSoup

data = """
<table>
<tr>
  <td>1,1</td>
  <td>1,2</td>
</tr>
<tr>
  <td>2,1</td>
  <td>2,2</td>
</tr>
</table>
"""

soup = BeautifulSoup(data)

for t in soup.findAll('table'):
    for tr in t.findAll('tr'):
        print [td.contents for td in tr.findAll('td')]

Edit: What should do the program if there're multiple links?

Ex:

<td><a href="#">A</a> B <a href="#">C</a></td>

BeautifulSoup or regex HTML table to data structure?

精彩评论

关注公众号

热门标签

图文推荐

BeautifulSoup or regex HTML table to data structure?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：