perl sort with wrap around support question_问答_开发者

this is a follow up to my previous "perl sort question" The final solution presented there works fine but the logs I want to sort have several wrap around occurrences of the sorting key. In this case I don't want the next segment of logs starting with 0x000xxxx to go to the top of the output file, see example below of the input and the desired output. Notice that the sorting key can repeat itself in 2 different segments and the segment boundary is not easy to find, some 0xfffxxxx entries will pop after a bunch of 0x000xxxx ones. Any ideas? Please refer to "perl sort question" for reference. the current code without wrap support is

#!/usr/bin/perl -w
my $line;
my $lastkey;
my %data;
while($line = <>) {
  chomp $line;
  if ($line =~ /\b(0x\p{AHex}{8})\b/) {
    # Begin a new entry
    #my $unique_key = $1 . $.; # cred to [Brian Gerard][3] for uniqueness
    my $unique_key = hex($1);
    $data{$unique_key} = $line;
    $lastkey = $unique_key;
  } else {
    # Continue an old entry
    $data{$lastkey} .= $line;
  }
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);

And a sample of the input/output I want to achieve

**input sample**
[2011-05-30 0xfff7ecf9=(bfn:4095,
[2011-05-30 0xfff80176=(bfn:4095,
[2011-05-30 0xfff8db3a=(bfn:4095,
[2011-05-30 0x00005686=(bfn:0,
[2011-05-30 0x00006b05=(bfn:0,
[2011-05-30 0xfff8c698=(bfn:4095,
[2011-05-30 0x00014692=(bfn:0,
[2011-05-30 0x00026537=(bfn:0,
[2011-05-30 0xfff80215=(bfn:4095,
[2011-05-30 0x00026f87=(bfn:0,
[2011-05-30 0x00027754=(bfn:0,

< thousands of lines from 0x000xxxxx to 0xfffxxxxx>

<next wrap zone>
[2011-05-30 0xfff709b4=(bfn:4095,
[2011-05-30 0xfff804f5=(bfn:4095,
[2011-05-30 0x00015af8=(bfn:0,
[2011-05-30 0x00016744=(bfn:0,
[2011-05-30 0xfff8e783=(bfn:4095,
[2011-05-30 0x00007744=(bfn:0,
[2011-05-30 0x0002368c=(bfn:0,
[2011-05-30 0x00024d0d=(bfn:0,
[2011-05-30 0x000326ae=(bfn:0,
[2011-05-30 0x00034ff3=(bfn:0,

< thousands of lines from 0x000xxxxx to 0xfffxxxxx>
< and so on >

 **desired output**

[2011-05-30 0xfff7ecf9=(bfn:4095,
[2011-05-30 0xfff80176=(bfn:4095,
[2011-05-30 0xfff80215=(bfn:4095,
[2011-05-30 0xfff8c698=(bfn:4095,
[2011-05-30 0xfff8db3a=(bfn:4095,
[2011-05-30 0x00005686=(bfn:0,
[2011-05-30 0x00006b05=(bfn:0,
[2011-05-30 0x00014692=(bfn:0,
[2011-05-30 0x00026537=(bfn:0,
[2011-05-30 0x00026f87=(bfn:0,
[2011-05-30 0x00027754=(bfn:0,

< thousands of sorted lines from 0x000xxxxx to 0xfffxxxxx>

[2011-05-30 0xfff709b4=(bfn:4095,
[2011-05-30 0xfff804f5=(bfn:4095,
[2011-05-30 0xfff8e783=(bfn:4095,
[2011-05-30 0x00007744=(bfn:0,
[2011-05-30 0x00015af8=(bfn:0,
[2011-05-30 0x00016744=(bfn:0,
[2011-05-30 0x0002368c=(bfn:0,
[2011-05-30 0x00024d0d=(bfn:0,
[2011-05-30 0x000326ae=(bfn:0,
[2011-05-30 0x00034ff3=(bfn:0,

and so on

Sample of log out of order

[2011-06-06 20:15:48.058200] 0xefe29556=(bfn:3838, sfn:766, sf:2.73, bf:85) / BIN_SEND :  (402) <=  UNKNOWN (sessionRef=0x2)
testSign {
  sigNo = 352785671
  transactionNo = 39027
  cellId = 0
  yrdty = 0
}

[2011-06-06 20:15:48.058468] 0xefe2d262=(bfn:3838, sfn:766, sf:3.00, bf:38) / BIN_REC : BB_SWU_INTERNAL_TIMEOUT2_IND (43) <=   (sessionRef=0x0)
0000 00 00 00 20 00 00 00 3e 00 08 00 05 00 01 00 01  '... ...>........'
0010 00 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00  '................'
0020 00 0f 00 00 05 06 05 07  '........'
(Unknown signal BB_SWU_INTERNAL_TIMEOUT2_IND)



[2011-06-06 20:15:48.058669] 0xefe:30316=(bfn:3838, sfn:766, sf:3.20, bf:49) 1/ BIN_REC :  (525) <=  UNKNOWN (sessionRef=0xa67b0)
testSign {
  sigNo = 23070220
  header {
    cellId = 0
   sfn = 766
   subFrameNo = 2开发者_StackOverflow
  }

  reportList[0] {
 {
   = 1
  bbef = 32 (0x00000020)
  isDtx { isDtx = 0 }
   {  = 1,  = 0,  = 3, nrOfTb = 1, padding0 = 0 }
  rxPower { prbListStart = 0, prbListEnd = 0, rxPowerReport = -1146, sinr = 64 }
  timingAdvanceError { timingAdvanceError = 0 }
  cfrPucch { cfrInfo { ri = 0, cfrLength = 0, cfrFormat = 0, cfrValid = 0, cfrExpected = 0, cfrCrcFlag = 0 }, cfr[] = [0, 0] as hex: [00 00 00 00] }
  }
 } 
}

[2011-06-06 20:15:48.055118] 0xefd91f8b=(bfn:3837, sfn:765, sf:9.67, bf:248) 4/_hInd LEVEL3 .c:1035: <!68!> cellId=0 subframeNr=1 : Combinded pdcchInd DL: pdcch=0: rnti=62 cceIndex=0 nrOfCce=8 nrOfRbaBits=25 startRbaBit=1 rbaBits=4294967168 dciFormat=6 nrOfPayloadBit=26 dciMsg={0x2300 0x7a80} =6 swapFlag=0 mcs={0 29} rv={1 2} ndi={0 0} pucchTpc=1
[2011-06-06 20:15:48.057932] 0xefe2586b=(bfn:3838, sfn:766, sf:2.47, bf:134) 4/ LEVEL2 .c:320: <!.118!> cellId=0 =20 subframeNr=4 : Assigned SE PQ : rnti=62 PQ(lcid=3 pqWeight=16530951 assignableBits=2664560 minPduSize=56)
 [2011-06-06 20:15:48.057932] 0xefe25a28=(bfn:3838, sfn:766, sf:2.47, bf:162) 4/_hInd LEVEL2 gchind.c:81: <!.19!> cellId=0 Receive UL PdcchInd - msg cellNo=0 msg subframe=0 dl subframe=4 msg len=4
 [2011-06-06 20:15:48.058066] 0xefe271d9=(bfn:3838, sfn:766, sf:2.60, bf:29) 4/_hInd LEVEL2 ce_schedsession_ovll1transjobready.c:149: <!.112!> L1Trans is ready and Ul pdcchInd is available -launching combinePdcch FO
 [2011-06-06 20:15:48.058066] 0xefe273b0=(bfn:3838, sfn:766, sf:2.60, bf:59) 4/ LEVEL3 ce_l1transfo.c:829: <!.76!> cellId=0 gRef=20 : Selected SE and PQ: rnti=62 PQ=1 lcid=1 assignableBits=0 assignedBits={0 0} minPduSize=56
 [2011-06-06 20:15:48.058066] 0xefe2744b=(bfn:3838, sfn:766, sf:2.60, bf:68) 4/ LEVEL3 ce_l1transfo.c:829: <!.76!> cellId=0 gRef=20 : Selected SE and PQ: rnti=62 PQ=2 lcid=2 assignableBits=0 assignedBits={0 0} minPduSize=56
 [2011-06-06 20:15:48.058066] 0xefe276b1=(bfn:3838, sfn:766, sf:2.60, bf:107) 4/_hInd LEVEL2 .c:857: <!.xxb!> tempNBundle=0 [0]=0 tempStoredNbundled[1]=9385 tempStoredNbundled[2]=0 tempStoredNbundled[3]=10698 dai=3, dciFormat=6, gRef=20

Something like this. I tried to write a description, but the code is clearer than the description ended up being, I think.

#!/usr/bin/perl -w
use strict;
my $lastkey;
my %data;

# how much can the value can be less than the previous seen maximum
# before it's considered to have jumped forward (where "less" and
# "maximum" are cognizant of wrap-around)
my $max_overlap = 0x40000000;

# we will process the file in chunks, reading while the values
# are between min_value and max_value, and then processing what
# we've read that's not within the max overlap of the new value
my $min_value;
my $max_value;
my $first_line = <>;
if ( defined $first_line && $first_line =~ /\b(0x\p{AHex}{8})\b/ ) {
    my $unique_key = hex($1);
    ($min_value, $max_value) = range_around($unique_key, $max_overlap);

    $data{$unique_key} = $first_line;
    $lastkey = $unique_key;
}
else {
    die "horribly";
}

while(my $line = <>) {
    if ($line =~ /\b(0x\p{AHex}{8})\b/) {
        # Begin a new entry
        my $unique_key = hex($1);

        unless (in_range($unique_key, $min_value, $max_value)) {
            # we've reached the end of a hunk; sort and output as much as we safely can
            my ($new_min_value, $new_max_value) = range_around($unique_key, $max_overlap);

            my @sorted =
                sort { $b>=$min_value <=> $a>=$min_value || $a <=> $b }
                grep in_range($_, $min_value, $new_min_value), keys %data;
            print delete @data{@sorted};

            ($min_value, $max_value) = ($new_min_value, $new_max_value);
        }

        $data{$unique_key} .= $line;
        $lastkey = $unique_key;
    } else {
        # Continue an old entry
        $data{$lastkey} .= $line;
    }
}

print @data{ sort { $b>=$min_value <=> $a>=$min_value || $a <=> $b } keys %data };

# check if value is between min and max, where max may be wrapped
sub in_range {
    my ($value, $min, $max) = @_;
    if ($max > $min) {
         return $value >= $min && $value <= $max;
    }
    else {
         return $value >= $min || $value <= $max;
    }
}

# calculate a range a given increment around a value
sub range_around {
    my ($value, $increment) = @_;
    my $min_value = $value - $increment;
    if ($min_value < 0) { ($min_value += 0xffffffff) +=1 }
    my $max_value = $value + $increment;
    if ($max_value > 0xffffffff) { ($max_value -= 0xffffffff) -= 1 }
    return ($min_value, $max_value);
}

I got rid of your chomp and newlines; they seemed unnecessary (and even unwanted, in the case of a multi-line record).

A little explanation:

sort { $b>=$min_value <=> $a>=$min_value || $a <=> $b } grep in_range($_, $min_value, $new_min_value), keys %data;

This line goes through the keys of %data (which are the segment numbers) and filters them with grep, only accepting ones that are in the range we know is safe to output already (because based on the record we just read (which itself will not be output at this point), we know that the identifiers won't go out of order any lower than $new_min_value).

It then sorts them. sort can be given a comparison block to override the default string-comparison based sort; it is repeatedly called to compare two elements from the list, with $a and $b being aliased to the element, and is expected to return -1, 0, or 1 depending on which should be sorted first, like the cmp and <=> operators do.

Here, it sorts using two comparisons. The first checks to see if either or both numbers being compared have wrapped around (by comparing them directly with the minimum value of the range; ones that have wrapped around from 0xfff... back to 0x000... will be less than $min_value). If only $a has wrapped around, the first <=> will be 1 <=> 0, giving 1, indicating $a sorts later than $b. If only $b has wrapped around, the <=> will be 0 <=> 1, giving -1, indicating $b sorts later than $a. In either case, the || is short circuited, so the second comparison isn't performed. When neither or both $a and $b have wrapped around, the <=> will be 1 <=> 1 or 0 <=> 0, respectively, which returns 0, causing the || to try the second comparison that just determines whether $a or $b is numerically first.