Aligning data table created from perl hash_问答_开发者

I'm trying to write a scri开发者_运维知识库pt to process output from behavioral testing equipment. I need to have all data aligned by timestamp in the resulting CSV file. Here's the catch: the start time differs between test runs (it's close, but not exact - can be off by a few seconds to several minutes). I can get the output I want, and I think I have a good idea as to how I can align all variables, but don't know how to implement it.

All data is in a hash with two levels ( %hash{id}{vars} ) with all variables stored as a number to keep things simple (variable names are read from an array on printout). Once all data has been scraped from the input files, the script walks through the hash and prints out data as follows:

Variable 1
ID #1   data1   data2   data3...
ID #2   data1   data2   data3...
...
Variable 2
...

and so on.

These are 24 h recordings. The last datapoint (var=20) for all subjects is light: data reads either "ON" or "OFF" for day and night. The best method of alignment I can see is to use the light OFF marker to align data.

My thinking is as follows:

1. Find first position for each ID for which var '20' = 'OFF' and record position

2. Figure out which ID has the greatest position for OFF (ie, the one that started recording earliest)

3. Add empty value pairs to every other subject until OFF position is the same for all.

For example, if data is recorded once per minute and one subject has an OFF time that is 5 minutes later than all others, add 5 empty data points to all other subjects to align the data.

This would have to be done for all datapoints for each subject, not just the lights on/off measure.

Would this approach work? And if so, how could I implement this?

**Note that I need to be able to package this as a standalone script to run on multiple computers, so I can't count on perl modules that aren't installed by default.

--edit per request: example. Input data looks like this (it's a CSV file)

ID,     TIME,      DATA1,  DATA2,  DATA3, [...] ,  LIGHT  
Subj1,  10:00:00,  data1,  data2,  data3, [...] ,  ON  
Subj1,  10:00:30,  data1,  data2,  data3, [...] ,  ON  
Subj1,  10:01:00,  data1,  data2,  data3, [...] ,  OFF  
Subj1,  10:01:00,  data1,  data2,  data3, [...] ,  OFF

For another subject, data might look like this:

ID,     TIME,      DATA1,  DATA2,  DATA3, [...] ,  LIGHT  
Subj2,  09:59:27,  data1,  data2,  data3, [...] ,  ON  
Subj2,  09:59:57,  data1,  data2,  data3, [...] ,  ON  
Subj2,  10:00:27,  data1,  data2,  data3, [...] ,  ON  
Subj2,  10:00:57,  data1,  data2,  data3, [...] ,  OFF  
Subj2,  10:01:27,  data1,  data2,  data3, [...] ,  OFF

Script takes each line from all files and adds them to a hash keyed by ID, with one level for each data column keyed by column number. For these two files hash would look like this:

$VAR1 = {
         'Subj1' => {
                     '1' => [
                             data1
                             data1
                             ...
                             ]
                      '2' => [
                             data2
                             data2
                             ...
                             ]
                     ...
                     '20' => [
                             ON
                             ON
                             ...
                    }
         'Subj1' => {
                     '1' => [
                             data1
                             data1
                             ...
                             ]
                      '2' => [
                             data2
                             data2
                             ...
                             ]
                     ...
                     '20' => [
                             ON
                             ON
                             ...
                    }
        };

Data is output with a foreach loop:

foreach my $k (sort {$a cmp $b} keys %data) { 
    print OUT $k, "\,";
    foreach my $d ( @{ $data{$k}{$i} } ) { print OUT $d, "\,"; }
    print OUT "\n";
    }

Output looks like this:

TIME  
Subj1,  10:00:00,  10:00:30,  10:01:00,  10:01:30,
Subj2,  09:59:27,  09:59:57,  10:00:27,  10:00:57,  10:01:27,
DATA1
Subj1,  data1,  data1,  data1,  data1,  data1,  
Subj2,  data2,  data2,  data2,  data2,  data2,  data2,
[ ... all other data ... ]
LIGHT
Subj1,  ON,  ON,  OFF, OFF,
Subj2,  ON,  ON,  ON,  OFF, OFF,

What I need to do is align all data by the ON/OFF columns in LIGHT, by adding empty values like so:

TIME  
Subj1,          ,  10:00:00,  10:00:30,  10:01:00,  10:01:30,
Subj2,  09:59:27,  09:59:57,  10:00:27,  10:00:57,  10:01:27,
DATA1
Subj1,       ,  data1,  data1,  data1,  data1,  data1,  
Subj2,  data2,  data2,  data2,  data2,  data2,  data2,
[ ... all other data ... ]
LIGHT
Subj1,    ,  ON,  ON,  OFF, OFF,
Subj2,  ON,  ON,  ON,  OFF, OFF,

Trying to figure out how best to do this. Sorry this is long...

How does this suit you?

It does use List::Util::max, but that has been standard for a while, and is easy to write by yourself if you don't have it.

use List::Util qw(max);
use strict;
use warnings;

my $ALLDATA = {
         'Subj1' => {
                     '1' => [
                             'data1',
                             'data1',
                             ],
                      '2' => [
                             'data2',
                             'data2',
                             ],
                     '20' => [
                             'ON',
                             'ON',
                             'OFF',
                             ]
                    },
         'Subj2' => {
                     '1' => [
                             'data1',
                             'data1',
                             ],
                      '2' => [
                             'data2',
                             'data2',
                             ],
                     '20' => [
                             'ON',
                             'ON',
                             'ON',
                             'OFF',
                             'OFF',
                             ]
                    },
        };

sub num_ons_before_first_off
{
    my $n = 0;
    foreach(@_)
    {
        last if $_ eq 'OFF';
        $n++;
    }
    return $n;
}

# store a 'numons' piece of data for each subject, for use later
foreach my $subject(values(%$ALLDATA))
{
    $subject->{'numons'} = num_ons_before_first_off(@{$subject->{'20'}}); 
}

# calculate the max 'numons' across all subjects
my $max_ons = max(map { $_->{'numons'} } values(%$ALLDATA));

foreach my $k(keys(%$ALLDATA))
{
    my $subject = $ALLDATA->{$k};

    #output leading blank entries
    print ',' x ($max_ons - $subject->{'numons'});

    #output the real data
    foreach my $data(@{$subject->{'20'}})
    {
        print "$data,";
    }
    print "\n";
}

Hopefully it's obvious how you could extend this to the rest of your data output.

Do you need the spacing as you have it in your question? You say it's CSV, so I am guessing not. I can update if that is important though.

This isn't an answer, but won't fit in a comment:

When you say you want to 'align all data by the ON/OFF columns in LIGHT', do you just mean right-align everything?

For example, if you had this data:

Subj1,ON,ON,OFF,
Subj2,ON,ON,ON,OFF,OFF,

Would the ouput be this?

Subj1,   ,   , ON,  ON, OFF,
Subj2, ON, ON, ON, OFF, OFF,

Or do you want it to be this:

Subj1,   , ON, ON, OFF,    ,       <-- Note trailing blank entry
Subj2, ON, ON, ON, OFF, OFF,

That is, aligning the first 'OFF', as in your textual description?

Got it! "max" was the key there. jwd, not sure why but I couldn't adapt your script to work with the data (kept getting a "can't use string as hash ref while strict" error). But it gave me the inspiration I needed.

On processing the files I implemented a counter that increments until the first OFF is reached. This is passed to a hash along with the subject ID as a key-value pair ($offset{$id} = $count).

When done processing files, I pull the max value out of the hash, then redefine the values using original key values subtracted from the max value.

When outputting, I used your code as a base to add the extra commas to pad the data. Final relevant sections look like this:

my $max_off = max values %offset;

foreach my $k ( keys %offset ) {
    $offset{$k} = $max_off - $offset{$k};
    }

foreach my $k (sort {$a cmp $b} keys %data) { 
    print OUT $k, "\,";
    print OUT ',' x ($offset{$k});
    foreach my $d ( @{ $data{$k}{$i} } ) { print OUT $d, "\,"; }
        print OUT "\n";
        }
    }

Does exactly what I needed. Thanks for the suggestions!