Friday, January 3, 2014

Estimating proportion, Wilson score vs normal approximation

In a lot of area, it is a common task to estimate the proportion and construct a confidence interval for a statistical sample. As stated in Wikipedia, the most common way to do this is using the normal approximation. Actually many online tools to calculate sample size are based on this. Although being simple, this method is known to fail when the proportion is very close to 0 or 1, especially when the sample size is limited. Now let's do a simple simulation to see how bad the situation can be.

library(Hmisc)
par(mfrow = c(2,2))

sample_size = 10
result <- matrix(NA, ncol = 99, nrow = 4)
row.names(result) <- c("WilsonLower","NormalLower","WilsonUpper","NormalUpper")
for (i in 1:dim(result)[2]){
  true_rate = 0.01 * i
  result[, i] <- apply(sapply(1:100, function(t) {
    my_sample <- rbinom(sample_size, 1, prob = true_rate)
    binconf(sum(my_sample), length(my_sample), method = "all")[c(5,6,8,9)]
  }), 1, mean)
}

plot(result[1,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Wilson, n = 10",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[3,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)
plot(result[2,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Normal, n = 10",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[4,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)

sample_size = 1000
result <- matrix(NA, ncol = 99, nrow = 4)
row.names(result) <- c("WilsonLower","NormalLower","WilsonUpper","NormalUpper")
for (i in 1:dim(result)[2]){
  true_rate = 0.01 * i
  result[, i] <- apply(sapply(1:100, function(t) {
    my_sample <- rbinom(sample_size, 1, prob = true_rate)
    binconf(sum(my_sample), length(my_sample), method = "all")[c(5,6,8,9)]
  }), 1, mean)
}

plot(result[1,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Wilson, n = 1000",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[3,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)
plot(result[2,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Normal, n = 1000",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[4,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)

It is very clear that the normal approximation yields C.I.s that are way to narrow when the true p is close to zero or one, especially when the sample size is small. It not only gives C.I.s that are too narrow, but also has C.I. bounds that are less than 0 or greater than 1. In comparison, Wilson score C.I.s are much more reasonable, although being asymmetric. When the sample size is large, the difference between Wilson score C.I. and normal approximation C.I. are almost negligible.