Signals & Noise: Confidence interval for event probability

It is very common that people ask "What's the minimum sample size do I need to estimate the probability of a particular event?" The answer is: it depends on (1) how confident do you want to be and (2) what's the (estimated) true rate of the event.

(1) is obvious if thinking from the perspective of a confidence interval. The most common case is 95% confidence interval. For a binary outcome, where the event can only be 0 and 1, the common formula is:
95% Confidence Interval = x_bar +/- 1.96 * sqrt(x_bar * (1-x_bar) / n)
Please note that x_bar (sample mean) is the MLE of the true probability

The problem here is that the MLE is known to be biased, when the true (and estimated) mean is close to zero or one, and therefore the 95% C.I. is also biased. This means in such cases, the chance that 95% C.I. estimated from sample covers that real (unknown) probability will be lower than 95%.

Let's do a simple simulation exercise to show this.

# This function simulates the construction of C.I. and return 1
# if the C.I. covers true probability and returns 0 if not.
my_func <- function(dummy){
my_sample <- rbinom(sample_size, 1, prob = true_rate)
my_mle <- mean(my_sample)
my_sd <- sqrt(my_mle * (1 - my_mle) / sample_size)
my_ci_upper <- my_mle + 1.96 * my_sd
my_ci_lower <- my_mle - 1.96 * my_sd
return(as.numeric(my_ci_lower <= true_rate &
true_rate <= my_ci_upper))
}
# Make four plots to illustrate the cases for
# sample = 200, 500, 2000, 5000
par(mfrow = c(2,2))
plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
xlab = "True Rate",
ylab = "Real Confidence Level",
main = "Sample size = 200")
abline(h = 0.95, col = "red")
sample_size = 200
result <- matrix(NA, 99)
for (j in 1:99){
true_rate = 0.01 * j
result[j] <- mean(unlist(lapply(1:10, function(t) mean(unlist(lapply(1:1000, my_func))))))
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))

plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
xlab = "True Rate",
ylab = "Real Confidence Level",
main = "Sample size = 500")
abline(h = 0.95, col = "red")
sample_size = 500
result <- matrix(NA, 99)
for (j in 1:99){
true_rate = 0.01 * j
result[j] <- mean(unlist(lapply(1:10, function(t) mean(unlist(lapply(1:1000, my_func))))))
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))

plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
xlab = "True Rate",
ylab = "Real Confidence Level",
main = "Sample size = 2000")
abline(h = 0.95, col = "red")
sample_size = 2000
result <- matrix(NA, 99)
for (j in 1:99){
true_rate = 0.01 * j
result[j] <- mean(unlist(lapply(1:10, function(t)
mean(unlist(lapply(1:1000, my_func))))))
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))

plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
xlab = "True Rate",
ylab = "Real Confidence Level",
main = "Sample size = 5000")
abline(h = 0.95, col = "red")
sample_size = 5000
result <- matrix(NA, 99)
for (j in 1:99){
true_rate = 0.01 * j
result[j] <- mean(unlist(lapply(1:10, function(t)
mean(unlist(lapply(1:1000, my_func))))))
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))

In these charts, the read line represent 95% confidence level, and the black lines represents the true probability the the 95% C.I. estimated using the formula will covers the true probability. In the ideal case, the black lines should overlap with the read lines. But as these charts show, they are significantly off when the sample sizes are small and the true probability is approaching zero or one.

Signals & Noise

Saturday, November 30, 2013

Confidence interval for event probability

No comments:

Post a Comment

About Me