开发者

How to match a column name and find out the column position in awk?

开发者 https://www.devze.com 2023-03-29 22:58 出处:网络
I am trying to parse some csv files using awk. I am new to s开发者_运维技巧hell scripting and awk.

I am trying to parse some csv files using awk. I am new to s开发者_运维技巧hell scripting and awk. The csv file i am working on looks something like this :

fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600

I need to find the average AccessTimes of the different functions.

I have been working with awk and have been able to get the average times provided the exact column numbers are specified like $2, $3 etc.

However I need to have a general script in which if i input "minAccessTime" in the command argument, I need the script to print the average AccessTime (instead of explicitly specifying $2 or $3 while using awk).

I have been googling about this and saw in various forums but none of them seems to work. Can someone tell me how to do this ? It would be of great help !

Thanks in advance!!


This awk script should give you all that you want.

It first evaluates which column you're interested in by using the name passed in as the COLM variable and checking against the first line. It converts this into an index (it's left as the default 0 if it couldn't find the column).

It then basically runs through all other lines in your input file. On all these other lines (assuming you've specified a valid column), it updates the count, sum, minimum and maximum for both the overall data plus each individual function name.

The former is stored in count, sum, min and max. The latter are stored in associative arrays with similar names (with _arr appended).

Then, once all records are read, the END section outputs the information.

NR == 1 {
    for (i = 1; i <= NF; i++) {
        if ($i == COLM) {
            cidx = i;
        }
    }
}

NR > 1 {
    if (cidx > 0) {
        count++;
        sum += $cidx;
        if (count == 1) {
            min = $cidx;
            max = $cidx;
        } else {
            if ($cidx < min) { min = $cidx; }
            if ($cidx > max) { max = $cidx; }
        }

        count_arr[$1]++;
        sum_arr[$1] += $cidx;
        if (count_arr[$1] == 1) {
            min_arr[$1] = $cidx;
            max_arr[$1] = $cidx;
        } else {
            if ($cidx < min_arr[$1]) { min_arr[$1] = $cidx; }
            if ($cidx > max_arr[$1]) { max_arr[$1] = $cidx; }
        }
    }
}

END {
    if (cidx == 0) {
        print "Column '" COLM "' does not exist"
    } else {
        print "Overall:"
        print "   Total records = " count
        print "   Sum of column = " sum
        if (count > 0) {
            print "   Min of column = " min
            print "   Max of column = " max
            print "   Avg of column = " sum / count
        }
        for (task in count_arr) {
            print "Function " task ":"
            print "   Total records = " count_arr[task]
            print "   Sum of column = " sum_arr[task]
            print "   Min of column = " min_arr[task]
            print "   Max of column = " max_arr[task]
            print "   Avg of column = " sum_arr[task] / count_arr[task]
        }
    }
}

Storing that script into qq.awk and placing your sample data into qq.in, then running:

awk -F, -vCOLM=minAccessTime -f qq.awk qq.in

generates the following output, which I'm relatively certain will give you every possible piece of information you need:

Overall:
   Total records = 7
   Sum of column = 2930
   Min of column = 200
   Max of column = 700
   Avg of column = 418.571
Function getStage:
   Total records = 3
   Sum of column = 1900
   Min of column = 600
   Max of column = 700
   Avg of column = 633.333
Function getInfo:
   Total records = 4
   Sum of column = 1030
   Min of column = 200
   Max of column = 300
   Avg of column = 257.5

For `maxAccessTime, you get:

Overall:
   Total records = 7
   Sum of column = 5120
   Min of column = 600
   Max of column = 1000
   Avg of column = 731.429
Function getStage:
   Total records = 3
   Sum of column = 2600
   Min of column = 800
   Max of column = 1000
   Avg of column = 866.667
Function getInfo:
   Total records = 4
   Sum of column = 2520
   Min of column = 600
   Max of column = 700
   Avg of column = 630

And, for xyzzy (a non-existent column), you'll see:

Column 'xyzzy' does not exist


If I understand the requirements correctly, you want the average of a column, and you'd like to specify the column by name.

Try the following script (avg.awk):

BEGIN {
  FS=",";
}

NR == 1 {
  for (i=1; i <= NF; ++i) {
    if ($i == SELECTED_FIELD) {
      SELECTED_COL=i;
    }
  }
}

NR > 1 && $1 ~ SELECTED_FNAME {
  sum[$1] = sum[$1] + $SELECTED_COL;
  count[$1] = count[$1] + 1;
}

END {
  for (f in sum) {
    printf("Average %s for %s: %d\n", SELECTED_FIELD, f, sum[f] / count[f]);
  }
}

and invoke your script like this

awk -v SELECTED_FIELD=minAccessTime -f avg.awk < data.csv

or

awk -v SELECTED_FIELD=maxAccessTime -f avg.awk < data.csv

or

awk -v SELECTED_FIELD=maxAccessTime -v SELECTED_FNAME=getInfo -f avg.awk < data.csv

EDIT:

Rewritten to group by function name (assumed to be first field)

EDIT2:

Rewritten to allow additional parameter to filter by function name (assumed to be first field)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号