开发者

Unable to filter a data frame?

开发者 https://www.devze.com 2023-04-10 15:27 出处:网络
I am using something like this to filter my data frame: d1 = data.frame(data[data$ColA == \"ColACat1\" & data$ColB == \"ColBCat2\", ])

I am using something like this to filter my data frame:

d1 = data.frame(data[data$ColA == "ColACat1" & data$ColB == "ColBCat2", ])

When I print d1, it works as expected. However, when I type d1$ColB, it still prints everything from the original data frame.

> print(d1)
ColA     ColB
-----------------
ColACat1 ColBCat2
ColACat1 ColBCat2

> print(d1$ColA)
Levels: ColACat1 ColACat2

Maybe this is expected but when I pass d开发者_JS百科1 to ggplot, it messes up my graph and does not use the filter. Is there anyway I can filter the data frame and get only the records that match the filter? I want d1 to not know the existence of data.


As you allude to, the default behavior in R is to treat character columns in data frames as a special data type, called a factor. This is a feature, not a bug, but like any useful feature if you're not expecting it and don't know how to properly use it, it can be quite confusing.

factors are meant to represent categorical (rather than numerical, or quantitative) variables, which comes up often in statistics.

The subsetting operations you used do in fact work normally. Namely, they will return the correct subset of your data frame. However, the levels attribute of that variable remains unchanged, and still has all the original levels in it.

This means that any method written in R that is designed to take advantage of factors will treat that column as a categorical variable with a bunch of levels, many of which just aren't present. In statistics, one often wants to track the presence of 'missing' levels of categorical variables.

I actually also prefer to work with stringsAsFactors = FALSE, but many people frown on that since it can reduce code portability. (TRUE is the default, so sharing your code with someone else may be risky unless you preface every single script with a call to options).

A potentially more convenient solution, particularly for data frames, is to combine the subset and droplevels functions:

subsetDrop <- function(...){
    droplevels(subset(...))
}

and use this function to extract subsets of your data frames in a way that is assured to remove any unused levels in the result.


This was such a pain! ggplot messes up if you don't do this right. Using this option at the beginning of my script solved it:

options(stringsAsFactors = FALSE)

Looks like it is the intended behavior but unfortunately I had turned this feature on for some other purpose and it started causing trouble for all my other scripts.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号