开发者

Reading file in a pattern using awk

开发者 https://www.devze.com 2023-01-05 19:59 出处:网络
I have an input file in following manner <td> Name1 </td> <td> <span class=\"test\"><a href=\"url1\">Link </a></span&开发者_StackOverflow社区gt;</td>

I have an input file in following manner

<td> Name1 </td>
<td> <span class="test"><a href="url1">Link </a></span&开发者_StackOverflow社区gt;</td>
<td> Name2 </td>
<td> <span class="test"><a href="url2">Link </a></span></td>

I want a awk script to read this file and output in following manner

url1 Name1
url2 Name2

Can anyone help me out in this trivial looking problem? Thanks.


Extracting one href per is relatively simple, so long as they conform to XHTML standards and there is only at most one on a line and you don't care about enclosing tags, but perl is easier:

$ perl -ne 'print "$1\n" if /href="([^"]+)"/'

If you care about enclosing tags or they are not standard conformant, you cannot use regular expressions to parse HTML. It is impossible.

added: oops, you do care about context, forget about regexps and use a real HTML parser


Here is an awk script that does the job

awk '
/a href=\".*\"/ { sub( /^.*a href=\"/,"" ); sub(/\".*/,"");  print $0, name }
                { name = $2 }
'


this might work:

awk 'BEGIN
     {i=1}{line[i++]=$0}
     END
     {
      j=1; 
      while (j<i) 
      {print line[j+1] line[j]; j+=2}
     }' yourfile|awk '{print substr($4,7,length($4)-6),$6}'


gawk '/^<td>/ {n = $2; getline; print gensub(/.*href="([^"]*).*/,"\\1",1), n}' infile

url1 Name1
url2 Name2


awk 'BEGIN{RS="></td>\n"; FS="> | </|\""}{print $7, $2}' infile

every 2 lines as a record.

0

精彩评论

暂无评论...
验证码 换一张
取 消