
How to process duplicate columns with conditions

开发者 https://www.devze.com 2023-04-07 11:31 出处:网络
I need to skip the all the r开发者_C百科ows with same column one, if column 2 is empty and then for others I need to calculate percentage of column 4 over column 3?

I need to skip the all the r开发者_C百科ows with same column one, if column 2 is empty and then for others I need to calculate percentage of column 4 over column 3?


T75PA       2   0   
T75PA   kk  4   1   
T240P       4   3   
T240P   test    3   3   
T240P   test2   3   1   
T245P   rr  8   1   
T245P   rr  33  1   
T226PA  fg  4   2   
T226PA  g   51  38  
T226PA  e   41  34


T245P   rr  8   1   0.125
T245P   rr  33  1   0.03030303
T226PA  fg  4   2   0.5
T226PA  g   51  38  0.745098039
T226PA  e   41  34  0.829268293

awk '
    NR==FNR {if (NF < 4) blank[$1]; next}
    $1 in blank {next}
    {$(NF+1) = $4/$3; print}
' datafile datafile | column -t

Since you say now that the field separator is tab:

awk '
    BEGIN {OFS = FS = "\t"}
    NR==FNR {if ($2 == "") blank[$1]; next}
    $1 in blank {next}
    {$5 = $4/$3; print}
' datafile datafile

I'll assume your data is tab seperated. A perl script something like this (I haven't tested it)...

my @data;
my %counts;
my %blanks;
while( my $line = <STDIN> )
    my @rec = split( "\t", $line );
    push( @data, \@rec );
    if( $rec[1] eq '' )
foreach my $rec ( @data )
    if( $counts{$rec->[0]} <= 1 || !$blanks{$rec->[0]} )
        print join( "\t", @$rec, $rec->[3] / $rec->[2] ) . "\n";

How about:

use Modern::Perl;

my $re = qr/^([A-Z0-9]+)\s+?(\S+|\s+)\s+(\d+)\s+(\d+)\s*$/;
my $skip = '';
while (<DATA>) {
    if (my @l = $_ =~ /$re/) {
        if ($l[1] =~ /^\s+$/ || $skip eq $l[0]) {
            $skip = $l[0];
        $skip = '';
        my $r = $l[3] / $l[2];
        say "$_\t$r";

T75PA       2   0   
T75PA   kk  4   1   
T240P       4   3   
T240P   test    3   3   
T240P   test2   3   1   
T245P   rr  8   1   
T245P   rr  33  1   
T226PA  fg  4   2   
T226PA  g   51  38  
T226PA  e   41  34


T245P   rr  8   1       0.125
T245P   rr  33  1       0.0303030303030303
T226PA  fg  4   2       0.5
T226PA  g   51  38      0.745098039215686
T226PA  e   41  34  0.829268292682927


awk '$2 ~ /[0-9]+/{for(i in res){if ($1 ~ res[i])delete res[i]};\
{if($1 in rm)next;ratio=$4/$3;res[NR]=$0"\t"ratio}\
END{for (i in res)print res[i]}' file

This will ignore all lines with fewer than four entries, for all other entries the ration is calculated and concatenated with the entrie and saved in the array res. After processing the file, the entries of res are printed to stdout.


T245P   rr  8   1       0.125
T245P   rr  33  1       0.030303
T226PA  fg  4   2       0.5
T226PA  g   51  38      0.745098
T226PA  e   41  34          0.829268

HTH Chris



验证码 换一张
取 消
