开发者

Assigning a specific number of values informed by a probability distribution (in R)

开发者 https://www.devze.com 2023-03-24 22:34 出处:网络
Hello and thanks in advance for the help! I am trying to generate a vector with a specific number of values that are assigned according to a probability distribution.For example, I want a vector of l

Hello and thanks in advance for the help!

I am trying to generate a vector with a specific number of values that are assigned according to a probability distribution. For example, I want a vector of length 31, contained 26 zeroes and 5 ones. (The total sum of the vector should always be five.) However, the location of the ones is important. And to identify which values should be one and which should be zero, I have a vector of probabilities (length 31), which looks like this:

probs<-c(0.01,0.02,0.01,0.02,0.01,0.01,0.01,0.04,0.01,0.01,0.12,0.01,0.02,0.01,
0.14,0.06,0.01,0.01,0.01,0.01,0.01,0.14,0.01,0.07,0.01,0.01,0.04,0.08,0.01,0.02,0.01)

I can select values according to this distribution and get a vector of length 31 using rbinom, but I can't select exactly five values.

Inv=rbinom(length(probs),1,probs)
Inv
[1] 0 0 0 0 0开发者_高级运维 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0

Any ideas?

Thanks again!


How about just using a weighted sample.int to select the locations?

Inv<-integer(31)
Inv[sample.int(31,5,prob=probs)]<-1
Inv
[1] 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0


Chase provides a great answer and mentions the problem of the run-away while() iteration. One of the problems with a run-away while() is that if you do this one trial at a time, and it takes many, say t, trials to find one that matches the target number of 1s, you incur the overhead of t calls to the main function, rbinom() in this case.

There is a way out, however, because rbinom(), like all of these (pseudo)random number generators in R, is vectorised, we can generate m trials at a time and check those m trials for conformance to the requirements of 5 1s. If none are found, we repeatedly draw m trials until we find one that does conform. This idea is implemented in the function foo() below. The chunkSize argument is m, the number of trials to draw at a time. I also took the opportunity to allow the function to find more than a single conformal trial; argument n controls how many conformal trials to return.

foo <- function(probs, target, n = 1, chunkSize = 100) {
    len <- length(probs)
    out <- matrix(ncol = len, nrow = 0) ## return object
    ## draw chunkSize trials
    trial <- matrix(rbinom(len * chunkSize, 1, probs),
                    ncol = len, byrow = TRUE)
    rs <- rowSums(trial)  ## How manys `1`s
    ok <- which(rs == 5L) ## which meet the `target`
    found <- length(ok)   ## how many meet the target
    if(found > 0)         ## if we found some, add them to out
        out <- rbind(out,
                     trial[ok, , drop = FALSE][seq_len(min(n,found)), , 
                                               drop = FALSE])
    ## if we haven't found enough, repeat the whole thing until we do
    while(found < n) {
        trial <- matrix(rbinom(len * chunkSize, 1, probs),
                            ncol = len, byrow = TRUE)
        rs <- rowSums(trial)
        ok <- which(rs == 5L)
        New <- length(ok)
        if(New > 0) {
            found <- found + New
            out <- rbind(out, trial[ok, , drop = FALSE][seq_len(min(n, New)), , 
                                                        drop = FALSE])
        }
    }
    if(n == 1L)           ## comment this, and
        out <- drop(out)  ## this if you don't want dimension dropping
    out
}

It works like this:

> set.seed(1)
> foo(probs, target = 5)
 [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
[31] 0
> foo(probs, target = 5, n = 2)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    0    0    0    0    0    0    0    0    0     0     0
[2,]    0    0    0    0    0    0    0    0    0     0     1
     [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,]     0     0     0     1     1     0     0     0     0     0
[2,]     0     1     0     0     1     0     0     0     0     0
     [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31]
[1,]     1     0     1     0     0     0     1     0     0     0
[2,]     1     0     1     0     0     0     0     0     0     0

Note that I drop the empty dimension in the case where n == 1. Comment the last if code chunk out if you don't want this feature.

You need to balance the size of chunkSize with the computational burden of checking that many trials at a time. If the requirement (here 5 1s) is very unlikely, then increase chunkSize so you incur fewer calls to rbinom(). If the requirement is likely, there is little point drawing trials and large chunkSize at a time if you only want one or two as you have to evaluate each trial draw.


I think you want to resample from the binomial distribution with a given set of probabilities until you hit your target value of 5, is that right? If so, then I think this does what you want. A while loop can be used to iterate until the condition is met. If you feed very unrealistic probabilites and target values, I guess it could turn into a run-away function, so consider yourself warned :)

FOO <- function(probs, target) {
  out <- rbinom(length(probs), 1, probs)

  while (sum(out) != target) {

    out <- rbinom(length(probs), 1, probs)
  }
  return(out)
}

FOO(probs, target = 5)

> FOO(probs, target = 5)  
 [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号