Awk selecting data rows_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-22 12:12 出处：网络

I have to process the following datafile using awk: YEARS:1995:1996:1997:1998:1999:2000 VISITS Domain1:259:2549:23695:24889:1240:21202

相关专题：rows selection

I have to process the following datafile using awk:

YEARS:1995:1996:1997:1998:1999:2000
VISITS
Domain1:259:2549:23695:24889:1240:21202
Domain2:32632:87521:147122:22952:2365:121230
Domain3:5985:92104:921744:43124:74234:68350
Domain4:8321:36520:68712:32102:22003:8开发者_StackOverflow2100
SIGNUPS
Domain1:212:202:992:1202:986:3253
Domain2:10401:44522:20103:3595:11410:353
Domain3:3695:23230:452030:25052:9858:3020
Domain4:969:24247:9863:24101:5541:3663

I need to know for each year and domain the total visits and signups. My problem is I can't find a way to select only the first four and the last four rows, can anybody give me some kind of hint on how to achieve that?

Example output (Visits only):

VISITS
Domain1     73834
Domain2     413822
Domain3     1205541
Domain4     309758

        1995    1996    1997    1998    1999    2000
All     47197   218694  1161273 123067  99842   292882

You could match the "VISITS" and "SIGNUPS" rows and set a variable indicating what kinds of records you are processing.

An example:

BEGIN {
    FS = ":";
}
/^YEARS/ {
    for (i = 2 ; i <= NF; i++) {
        year[i] = $i;
    }
    next;
}
/^VISITS/ {
    mode = "VISITS";
    next;
}
/^SIGNUPS/ {
    mode = "SIGNUPS";
    next;
}
{
    for (i = 2; i <= NF; i++) {
        # output "VISITS"/"SIGNUPS", domain, year, value
        print mode, $1, year[i], $i;
    }
}

awk -F: 'END { out( ) }
/^YEARS/ {
  for ( i = 1; ++i <= NF; ) {
    y[i] = $i
    yh = yh ? yh OFS $i : $i
    }
    ny = NF; next   
  }
NF == 1 { 
  m && out( ); m = $1
  }
{
  ym[y[1]] = "ALL:"
  for ( i = 1; ++i <= NF; ) {
    d[$1] += $i; ym[y[i]] += $i
    }   
  } 
func out( ) {
  print m
  for ( D in d ) print D, d[D]
  printf "\n%s\n", OFS yh
  for ( i = 0; ++i <= ny; )
    printf "%s", ( ym[y[i]] ( i < ny ? OFS : RS ) )
  print x; split( x, d ); split( x, ym )  
  }' OFS='\t' infile

With GNU awk you could use:

delete d; delete ym

instead of:

split( x, d ); split( x, ym )

When you say "select only the first four and the last four rows", I assume you mean to process the visits and signups separately:

awk -F: '
$1 == "YEARS"   {for (i=2; i<=NF; i++) {yr[i] = $i}; next}
$1 == "VISITS"  {visits = 1; signups = 0; next}
$1 == "SIGNUPS" {visits = 0; signups = 1; next}
visits { 
  for (i=2; i<=NF; i++) {
    v_d[$1] += $i     # visits by domain
    v_y[yr[i]] += $i  # visits by year
  }
}
signups {
  for (i=2; i<=NF; i++) {
    s_d[$1] += $i     # signups by domain
    s_y[yr[i]] += $i  # signups by year
  }
}
END {
  OFS=FS
  print "VISITS"
  for (d in v_d) print d, v_d[d]
  for (y in v_y) print y, v_y[y]
  print "SIGNUPS"
  for (d in s_d) print d, s_d[d]
  for (y in s_y) print y, s_y[y]
}'

Given your input, this outputs

VISITS
Domain1:73834
Domain2:413822
Domain3:1205541
Domain4:249758
1999:99842
2000:292882
1995:47197
1996:218694
1997:1161273
1998:123067
SIGNUPS
Domain1:6847
Domain2:90384
Domain3:516885
Domain4:68384
1999:27795
2000:10289
1995:15277
1996:92201
1997:482988
1998:53950