开发者

Regex: Optional HTML tags in HTML?

开发者 https://www.devze.com 2023-04-09 15:26 出处:网络
I need to parse some values from HTML. I\'m using the following regex to parse out some groups, but am having difficulty when there are optional tags in the middle of the HTML. I need some rule to pul

I need to parse some values from HTML. I'm using the following regex to parse out some groups, but am having difficulty when there are optional tags in the middle of the HTML. I need some rule to pull out the values from repeated version of the HTML page, even when the optional tags are included.

 onclick="return raise('SelectFare', new SelectFareEventArgs(1, 3, 'F'))" required="true" requiredError="Please select a flight and fare in every market."></td><td>Regular Fare</td><td>Adult<br></td><td align="right" style="font-size:110%;">91.99 EUR<br><div style="font-style: italic; font-size: 10px;">Only<span style="color: red;"> 4 </span>seats left at this fare</div></td><td></td><td><b>Fri</b>30 Sep 11<br><b>Flight</b>FR 818</td><td>15:10 Depart<br>16:15 Arrive</td></tr><tr id="1_2011_8_30_23_45_00"><td><div class="planeImg1" title="Click to select this fare on this flight"></div></td><td><input

For example, the optional <div style="font-style: italic; font-size: 10px;">Only<span style="color: red;"> 4 </span>seats left at this fare</div> section of this is messing it up.

tr><tr id="1_2011_9_21_16_05_00"><td><div class="planeImg1" title="Click to select this fare on this flight"></div></td><td><input id="AvailabilityInputFRSelectView_RadioButtonMkt1Fare2" type="radio" name="AvailabilityInputFRSelectView$market1" value="H~HDIS1~XXXC~~RoundFrom|FR~ 816~ ~~DUB~10/21/2011 14:55~EDI~10/21/2011 16:05" onclick="return raise('SelectFare', new SelectFareEventArgs(1, 2, 'H'))" required="true" requiredError="Please select a flight and fare in every market."></td><td>No Taxes</td><td>Adult<br></td><td align="right" style="font-size:110%;"><strike style="color:#F00;font-size:80%;"><b style="color: #999;">22.99 EUR</b></strike>
                             (-35%)
                          <br>14.94 EUR<br></td><td></td><td><b>Fri</b>21 Oct 11<br><b>Flight</b>FR 816</td><td>14:55 Depart<br>16:05 Arrive</td></tr><tr id="1_2011_9_21_16_15_00"><td><div class="planeImg1" title="Click

The

<strike . . </strike>. . (-35%). . <br>14.94 EUR<br></td>

part of the HTML above is messing it up as well.

This is the regex I'm trying (and various other versions!!):

"Please select(?:.*?)<td>(.*?)</td><td>(.*?)<br></td><td align=\"right\" style=\"font-size:110%;\">(.*?)<br>(.*?)<br开发者_运维技巧>(?:.*?)</b>(.*?)<br><b>Flight</b>(.*?)</td><td>(.*?)<br>(.*?)</td>"

I'd appreciate any help at all on this, or even a reference to learning how to parse out optional HTML tags altogether.

Thanks.


You can't parse (X)HTML with RegEx, so don't do it. You need to use a proper parser that will build you a Document Object Model (DOM). As you have tagged your question with JavaScript, I recommend that you use jQuery to build an object graph of your HTML, simply like this:

var $document = $(html);

This $document object can now be operated on with methods like $document.find() to dig out the elements you want from the HTML.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号