开发者

Regex fails to match for no obvious reason

开发者 https://www.devze.com 2023-01-06 02:29 出处:网络
Consider the two following regular expression snippets and dummy HTML that it should be matching: Apparently, I can only post one link until I get more reputation, so the link below contains the thre

Consider the two following regular expression snippets and dummy HTML that it should be matching:

Apparently, I can only post one link until I get more reputation, so the link below contains the three links I referenced above:

开发者_开发技巧

http://pastebin.com/Qj1uxfdk

The difference between the two snippets, if anyone is wondering, is a removed (((.{2,20}?), (.{2,20}?))?) about half-way through the snippet.

The first snippet does not match the text, but the second one does, and I cannot figure out why. I tried putting a dummy expression that should match anything in its place (such as (.{1})?) and it still fails to match it, but when I remove it, it suddenly matches again.

I've been toiling with this stupid expression for the last 4 hours and I'm about at my wits' end. Can anybody help?


I am terribly sorry, I know this answer wouldn't be much appreciated by anybody for various reasons, but anyway, I feel that I have to say this.

It seems to me, that you are probably using the wrong tool. I suggest, that you use a real parser, that is intended to parse (x)html/xml. I think, html contains far more subtleties, than you are realistically able to catch with your regular expression. I, myself, haven't written any php for quite a time, but I am sure it has the neccessary tools to do the parsing for you (maybe this?).

Of course it is exciting to do everything yourself, but it is more practical to take advantage of what's been done (and tested) for you.

I hope, that you will keep this in mind.

PS: Yes, I know, that the usual "Do not parse xml with regex" statement is extremely trite/banal, but it doesn't stop it from being true for the majority of cases.


Since you seem to know that RegEx isnt really the thing when it comes to parsing HTML,
why do you still try to..?

DOM for example isn't as hard as you might think;
A basic example of getting all the td's in your HTML:

$html = <<< EOL
<tr><td nowrap class="border_on_rbl"><a href="employee_view.html?employee_id=1337">bloblaw</td><td nowrap class="border_on_rb">Loblaw, Bob</b></td><td nowrap class="border_on_rb">Lawyer</td>
<td nowrap class="border_on_rb">Legal</td>
<td nowrap class="border_on_rb">person4</td><td nowrap class="border_on_rb"></td><td nowrap class="border_on_rb">Bluth, Maeby</td><td nowrap class="border_on_rb"><a href=mailto:bloblaw@theplanet.com>bloblaw@theplanet.com</a></td><td nowrap class="border_on_rb">555.555.5555</td><td nowrap class="border_on_rb">1337</td></tr>
EOL;

libxml_use_internal_errors(true);
$dom = DOMDocument::loadHTML($html);

$tds = $dom->getElementsByTagName('td');
foreach ($tds as $td) {
    echo $td->nodeValue.'<br>';
}

?>

Take some time to read the manual/some tutorials/articles/.. about DOM and you'll never have (RegEx) problems parsing (not just) HTML..


It was a bit easier to rewrite it than to debug it, so here's my approach :

preg_match_all(
    '%<tr>[^<]*
      <td[^>]*><a.*?employee_id=(\d*).*?>(\w*)\s*.*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*),\s*(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
      <td[^>]*><a[^>]*>(.*?)</a>.*?&nbsp;</td>[^<]*
      <td[^>]*>(\d{3}\.\d{3}\.\d{4}).*?&nbsp;</td>[^<]*
      <td[^>]*>(\w*).*?&nbsp;</td>[^<]*
    </tr>%sx', 
    $subject, $result, PREG_SET_ORDER);

It works for your example and you can tweak it if you like more or less validation.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号