Avoiding a for loop in R in an attempt to evaluate percentage of true positives/negatives when using logistic regression_问答_开发者

Avoiding a for loop in R in an attempt to evaluate percentage of true positives/negatives when using logistic regression

开发者 https://www.devze.com 2023-03-22 06:36 出处：网络

What I got: A matrix where I got the predicted probability of an outcome (from a logistic regression model) and the known outcome. For those curious I actually got two regression models and an indepen

相关专题：logic

> head(matrixComb)
      probComb outComb
[1,] 0.9999902       1
[2,] 0.9921736       0
[3,] 0.9901175       1
[4,] 0.9815581       0
[5,] 0.7692992       0
[6,] 0.7369990       0

What I want: A graph where I can plot how often my prediction model yields correct outcomes (one line for positives and one line for negatives) as a function of the cut off value for the probability. My problem is that I am unable to figure out how to do this without switching to Perl and use to For-loop to iterate through the matrix.

In Perl I would just start at probability 0.1 and in reach run of the for-loop increase the value by 0.1. In the first iteration I would count all probabilities <0.1 and outcome = 0 as true negatives, probability < 0.1 and outcome 1 as false negatives probability > 0.1 and outcome = 0 as false positives and probability > 0.1 and outcome = 1 as true positives.

The process would then be repeated and the results of each iteration would be printed as [probability, true positives/total positives, true negatives/total negatives]. Thus make it easy for me to print it out in open office开发者_开发问答 calc.

The reason that I am asking this is that the operation is too complex for me to find a similar case here on stackoverflow or in a tutorial. But I would really like to learn a way to do this in an efficient manner in the R environment.

You can get R to draw the curves which are based on ROC analysis. This is a crude version using the ROCR package and could easily be made prettier

ss <- 1000   # sample size
mydf <- data.frame(probComb = runif(ss)) # predictions illustration
mydf$outComb <- 0 + (runif(ss) < mydf$probComb) # actuals illustration

library(ROCR)
pred <- prediction(mydf$probComb, mydf$outComb)
perfp <- performance(pred, "tpr")
perfn <- performance(pred, "tnr")
plot(perfp, col="green", ylab="True positive (green) and true negative (red) rates")
plot(perfn, col="red", ylab="True negative rate", add=TRUE)

to produce

Avoiding a for loop in R in an attempt to evaluate percentage of true positives/negatives when using logistic regression

If you must, you can find the data in perfp and perfn.

Here's a way to do this manually:

#Create some sample data
dat <- data.frame(x=runif(100),y=sample(0:1,100,replace=TRUE))

#Function to compute tp and tn
myFun <- function(x){
    tbl <- table(dat$x > x,dat$y)
    marg <- margin.table(tbl,2)
    tn <- tbl[1,1]/marg[1]
    tp <- tbl[2,2]/marg[2]
    rs <- c(tp,tn)
    names(rs) <- c('truePos','trueNeg')
    return(rs)
}


#Decision thresholds
thresh <- seq(0.1,0.9, by = 0.1)
#Loop using lapply
temp <- as.data.frame(do.call(rbind,lapply(thresh,myFun)))
temp$thresh <- thresh

#Melt and plot using ggplot
tempMelt <- melt(temp,id.vars="thresh")
ggplot(tempMelt,aes(x=thresh,y=value)) + 
    geom_line(aes(group=variable,colour=variable))

Avoiding a for loop in R in an attempt to evaluate percentage of true positives/negatives when using logistic regression

Alternatively, as mentioned above in the comments, there are a plethora or ROC functions in R which can be found using ??ROC. For example, using roc from the caret package:

temp <- as.data.frame(roc(dat$x,factor(dat$y)))
tempMelt <- melt(temp,id.vars="cutoff")
ggplot(tempMelt,aes(x=cutoff,y=value)) + 
    geom_line(aes(group=variable,colour=variable))

Avoiding a for loop in R in an attempt to evaluate percentage of true positives/negatives when using logistic regression

Maybe something like this:

# A function for counting outcomes for a certain probability
f <- function(d, p) {
  lp <- d$prob < p
  c(TNeg=sum(lp & d$out==0), TPos=sum(!lp & d$out==1))
}

# Make it accept a vector of probabilities
vf <- Vectorize(f, 'p')

# Sample data
n <- 100
d <- data.frame(prob=runif(n), out=round(runif(n)))
# Probabilities to plot
p <- seq(0,1, len=20)

res <- vf(d, p)
colnames(res) <- paste('p(', p, ')', sep='')
matplot(p, t(res), type='l', xlab='prob', ylab='count')