pre-populate associative array keys in awk?_问答_开发者

I've written a munin plugin that uses slurm's sacct to monitor job states on a HPC cluster. I've written it in sh + awk (rather than my usual tool of choice, perl).

The script works, but it took me ages to figure out how to pre-populate the associative array of possible states (some/most may not be present in sacct output, and i want them to default to zero). Google wasn't much help, and the best I could come up with was to use split on a string to produce a temporary array, which I then iterated over.

I came up with this:

BEGIN {
    num = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
    for (i=1;i<=num;i++) {
        states[statenames[i]] = 0
    }
  }

This works, but seems clumsy compared to how i'd do it in perl, like this:

foreach (qw(cancelled completed completing failed nodefail pending running suspended timeout)) {
    $states{$_} = 0;
}

or this

%states = map {$_ => 0} qw(cancelled completed completing failed nodefail pending running suspended timeout);

my question is: is there a way of doing this in awk that is similar to either of the perl versions?

[ edited ]

to clarify, here's a sample of the sacct output i'm piping into开发者_如何学JAVA awk. Note that the only states in this output are RUNNING, COMPLETED, and CANCELLED - the others don't exist (because they haven't occurred today), but i want them in my script's output anyway (in a form usable by munin as "statename.value 0").

# sacct -X -P -o 'state' -n
RUNNING
RUNNING
RUNNING
RUNNING
COMPLETED
RUNNING
COMPLETED
RUNNING
COMPLETED
COMPLETED
CANCELLED by 1000
COMPLETED

[ edited again ]

and here's sample output from my munin plugin:

# ./slurm-sacct
suspended.value 0
pending.value 0
nodefail.value 0
failed.value 0
running.value 6
completing.value 0
completed.value 5
timeout.value 0
cancelled.value 1

The script runs and does what I want, I just wanted to know if there was a better way to initialise the associative array.

You probably don't need to do it at all. Variables in awk are dynamic, which means they're automatically initialized when they are first used (either assigned to or accessed), and this applies to array elements as well.

A variable will be initialized to 0 if it's accessed in a numeric context, or to the empty string otherwise. (At least gawk does this, though I'm not sure if it's implementation-dependent) So if you're doing something like counting the number of jobs that are in each state, the entire program is as simple as something like

{ states[$1]++ }
END {
     for (state in states) print state, states[state]
}

Each time the expression states[$1]++ is executed, it will check for the existence of states[$1] and initialize it to 0 if it doesn't already exist.

EDIT: From your comment I'm guessing you want to print out a line for each possible state, regardless of whether there are any jobs in that state or not. In that case, you need to include all the possible state names, and there is no shortcut notation for doing so as there is in Perl. As far as I know, what you've already found is about as clean as it gets. (Awk is not really designed with that usage in mind)

I'd suggest the following:

{ states[$1]++ }
END {
     split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
     for (state in statenames) print state, states[state]+0
}

Perhaps Craig can use instead of :

print "Timeout states ",states[timeout],".";

this:

print "Timeout states ",int(states[timeout]),".";

In my case if there is no timeout state in awk input, the first print will give:

Timeout states .

While the second will give:

Timeout states 0.

I think a more natural approach in awk would be to have a separate file of keys. Consider a file keys.txt with one key per line. You could then do something like this:

printf "key1\nkey2\nkey2\nkey5" | 
  awk '
    FILENAME == "keys.txt" {
      counts[$0] = 0
      next
    }

    {
      counts[$0]++
    }

    END {
      for (key in counts) {
        print key, counts[key]
      }
    }' keys.txt -

With five keys in keys.txt, this produces:

key1 1
key2 2
key3 0
key4 0
key5 1

Although the keys are shown in order here, that's just incidental and shouldn't be relied upon.

For the specific example, you could also skip the associative array altogether. Instead, you could minimally process the lines with awk and use sort | uniq -c to tabulate the counts. The presence of all keys could be ensured using join against a file of keys.

awk is somewhat clumsier (I would say "less terse") than Perl.

You could write this (similar to @Michael's answer):

pipeline of data |
awk '
  NR == FNR {statenames[$1]=0; next}
  { usual processing }
  END { usual output }
' <(printf "%s\n" cancelled completed completing failed nodefail pending running suspended timeout) -

One tweak to @DavidZaslavsky's answer might be to print the states in the order you specified them on the split() line. That would be:

{ states[tolower($1)]++ }
END {
     n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
     for (i=1; i<=n; i++) {
         state = statenames[i]
         print state, states[state]+0
     }
}

I also converted the input to lower case so it matches your hard-coded values, got rid of the unnecessary 3rd arg to split() and the subsequent null statement (trailing semi-colon).

In case you want to account for finding state names in your input that weren't in your hard-coded set, you could tweak it to:

{ states[tolower($1)]++ }
END {
     n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
     for (i=1; i<=n; i++) {
         state = statenames[i]
         print state, states[state]+0
         delete states[state]
     }
     for (state in states) {
         print "WARNING: found new state name %s\n",state | "cat>&2"
         print state, states[state]+0
     }
}