开发者

How to process duplicate columns with conditions

开发者 https://www.devze.com 2023-04-07 11:31 出处:网络
I need to skip the all the r开发者_C百科ows with same column one, if column 2 is empty and then for others I need to calculate percentage of column 4 over column 3?

I need to skip the all the r开发者_C百科ows with same column one, if column 2 is empty and then for others I need to calculate percentage of column 4 over column 3?

Input:

T75PA       2   0   
T75PA   kk  4   1   
T240P       4   3   
T240P   test    3   3   
T240P   test2   3   1   
T245P   rr  8   1   
T245P   rr  33  1   
T226PA  fg  4   2   
T226PA  g   51  38  
T226PA  e   41  34

Output

T245P   rr  8   1   0.125
T245P   rr  33  1   0.03030303
T226PA  fg  4   2   0.5
T226PA  g   51  38  0.745098039
T226PA  e   41  34  0.829268293


awk '
    NR==FNR {if (NF < 4) blank[$1]; next}
    $1 in blank {next}
    {$(NF+1) = $4/$3; print}
' datafile datafile | column -t

Since you say now that the field separator is tab:

awk '
    BEGIN {OFS = FS = "\t"}
    NR==FNR {if ($2 == "") blank[$1]; next}
    $1 in blank {next}
    {$5 = $4/$3; print}
' datafile datafile


I'll assume your data is tab seperated. A perl script something like this (I haven't tested it)...

my @data;
my %counts;
my %blanks;
while( my $line = <STDIN> )
{
    chop($line);
    my @rec = split( "\t", $line );
    push( @data, \@rec );
    $counts{$rec[0]}++;
    if( $rec[1] eq '' )
    {
        $blanks{$rec[0]}++;
    }
}
foreach my $rec ( @data )
{
    if( $counts{$rec->[0]} <= 1 || !$blanks{$rec->[0]} )
    {
        print join( "\t", @$rec, $rec->[3] / $rec->[2] ) . "\n";
    }
}


How about:

#!/usr/bin/perl
use Modern::Perl;


my $re = qr/^([A-Z0-9]+)\s+?(\S+|\s+)\s+(\d+)\s+(\d+)\s*$/;
my $skip = '';
while (<DATA>) {
    chomp;
    if (my @l = $_ =~ /$re/) {
        if ($l[1] =~ /^\s+$/ || $skip eq $l[0]) {
            $skip = $l[0];
            next;
        }
        $skip = '';
        my $r = $l[3] / $l[2];
        say "$_\t$r";
    }
}

__DATA__
T75PA       2   0   
T75PA   kk  4   1   
T240P       4   3   
T240P   test    3   3   
T240P   test2   3   1   
T245P   rr  8   1   
T245P   rr  33  1   
T226PA  fg  4   2   
T226PA  g   51  38  
T226PA  e   41  34

output:

T245P   rr  8   1       0.125
T245P   rr  33  1       0.0303030303030303
T226PA  fg  4   2       0.5
T226PA  g   51  38      0.745098039215686
T226PA  e   41  34  0.829268292682927


try:

awk '$2 ~ /[0-9]+/{for(i in res){if ($1 ~ res[i])delete res[i]};\
rm[$1]=$1;next}\
{if($1 in rm)next;ratio=$4/$3;res[NR]=$0"\t"ratio}\
END{for (i in res)print res[i]}' file

This will ignore all lines with fewer than four entries, for all other entries the ration is calculated and concatenated with the entrie and saved in the array res. After processing the file, the entries of res are printed to stdout.

Output:

T245P   rr  8   1       0.125
T245P   rr  33  1       0.030303
T226PA  fg  4   2       0.5
T226PA  g   51  38      0.745098
T226PA  e   41  34          0.829268

HTH Chris

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号