Subsetting in R, joining and calculating multiple repetitions_问答_开发者

Subsetting in R, joining and calculating multiple repetitions

开发者 https://www.devze.com 2023-03-18 13:38 出处：网络

Here is a sample: > tmp labelvalue1开发者_JS百科value2 1aa_x_xxxxx 2bc_x_xxxxx 3aa_x_xxxxx 4bc_x_xxxxx

Here is a sample:

> tmp
    label   value1  开发者_JS百科value2
1   aa_x_x  xx      xx
2   bc_x_x  xx      xx
3   aa_x_x  xx      xx
4   bc_x_x  xx      xx

How to calculate median of all repeated labels (or more, of the corresponding values in other data frame columns), but taking into account only the first two letters (ie. "aa_1_1" and "aa_s_3" are the same values)? The list of labels is finite and usable.

I have read about aggregate, %in%, subset and substr, but I am unable to compile anything useful and simple.

Here is what I hope to get:

> tmp.result
    label   median1 some.calculation2
1   aa      xx      xx
2   bc      xx      xx
3   aa      xx      xx
4   bc      xx      xx

Thank you very much.

Have you tried making a new data frame--I'll call it tmp2--where tmp2$label==substr(tmp$label,0,2)? From there, you can, for example, use tapply(tmp2$value1,tmp2$label,mean) to get the average values of value1 aggregated over tmp2$label.

An option using dplyr

library(dplyr)
tmp %>%
   group_by(label=sub('_.*$', '', label)) %>% 
   transmute(median1=median(value1), mean1=mean(value2))

Or data.table

 library(data.table)
 setDT(tmp)[,  c('median1', 'mean1') := list(median(value1), 
    mean1= mean(value2)) , .(label=sub('_.*$', '', label))][, c(1,4:5), 
       with=FALSE]