How can I identify and summarize sets of data from matching groups in a dataframe?_问答_开发者

How can I identify and summarize sets of data from matching groups in a dataframe?

开发者 https://www.devze.com 2023-03-31 19:05 出处：网络

Here is an example dataframe: set.seed(0) x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2) x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)

相关专题：dataframe

Here is an example dataframe:

set.seed(0)
x1 <- c(1, 1, 1, 1, 1, 2, 2, 2, 2)
x2 <- c(1, 1, 0, 0, 0, 1, 1, 1, 1)
x3 <- c(1, 1, 2, 2, 4, 1, 1, 2, 1)
n  <- c(1, 1, 1, 5, 5, 1, 1, 1, 1)
y <- rnorm(9)

mydf <- data.frame(x1, x2, x3, n, y)

What I would like to do is

identify rows with n=1 and which share identical values of (x1, x2, x3)
return a single row for each subset with y = mean(y) and n = length(y)
keep other 开发者_开发问答rows the same.

for example, the new dataframe would be

x1 <- c(1,            1,    1,    1,    2,                 2)
x2 <- c(1,            0,    0,    0,    1,                 1)
x3 <- c(1,            2,    2,    4,    1,                 2)
n  <- c(2,            1,    5,    5,    3,                 1)
y  <- c(mean(y[1:2]), y[3], y[4], y[5], mean(y[c(6:7,9)]), y[8])

newdf <- data.frame(x1, x2, x3, n, y)

I can figure this out with conditionals and loops, but I would prefer to learn more elegant way to do this.

By "identical values in other columns", I take it you mean that each subset is defined by the same value of x1 in each of the rows of the subset, not that x1 is equal to x2. Thanks for the example to see what you meant.

library("plyr")

To get parts one and two

ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y))

This can be rbind-ed with the part of mydf where n!=1 to get what you said

rbind(
  ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, n = length(y), y = mean(y)),
  mydf[mydf$n!=1,]
)

This doesn't have the same order as you listed. If that is really important, you can add some auxiliary sorting variables.

mydf$order = seq(length=nrow(mydf))
newdf <- rbind(
  ddply(mydf[mydf$n==1,], .(x1, x2, x3), summarise, 
    n = length(y), y = mean(y), order=min(order)),
  mydf[mydf$n!=1,]
)
newdf <- newdf[order(newdf$order),]
newdf$order <- NULL