Friday, January 3, 2014

Estimating proportion, Wilson score vs normal approximation

In a lot of area, it is a common task to estimate the proportion and construct a confidence interval for a statistical sample. As stated in Wikipedia, the most common way to do this is using the normal approximation. Actually many online tools to calculate sample size are based on this. Although being simple, this method is known to fail when the proportion is very close to 0 or 1, especially when the sample size is limited. Now let's do a simple simulation to see how bad the situation can be.

library(Hmisc)
par(mfrow = c(2,2))

sample_size = 10
result <- matrix(NA, ncol = 99, nrow = 4)
row.names(result) <- c("WilsonLower","NormalLower","WilsonUpper","NormalUpper")
for (i in 1:dim(result)[2]){
  true_rate = 0.01 * i
  result[, i] <- apply(sapply(1:100, function(t) {
    my_sample <- rbinom(sample_size, 1, prob = true_rate)
    binconf(sum(my_sample), length(my_sample), method = "all")[c(5,6,8,9)]
  }), 1, mean)
}

plot(result[1,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Wilson, n = 10",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[3,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)
plot(result[2,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Normal, n = 10",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[4,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)

sample_size = 1000
result <- matrix(NA, ncol = 99, nrow = 4)
row.names(result) <- c("WilsonLower","NormalLower","WilsonUpper","NormalUpper")
for (i in 1:dim(result)[2]){
  true_rate = 0.01 * i
  result[, i] <- apply(sapply(1:100, function(t) {
    my_sample <- rbinom(sample_size, 1, prob = true_rate)
    binconf(sum(my_sample), length(my_sample), method = "all")[c(5,6,8,9)]
  }), 1, mean)
}

plot(result[1,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Wilson, n = 1000",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[3,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)
plot(result[2,] ~ seq(from = 0.01, to = 0.99, by = 0.01),
     type = "l", col = "red", ylim = c(-0.05, 1.05), main = "Normal, n = 1000",
     xlab = "True p", ylab = "Upper and Lower 95% C.I. bounds")
lines(result[4,] ~ seq(from = 0.01, to = 0.99, by = 0.01),, col = "blue")
abline(h=1); abline(h=0)

It is very clear that the normal approximation yields C.I.s that are way to narrow when the true p is close to zero or one, especially when the sample size is small. It not only gives C.I.s that are too narrow, but also has C.I. bounds that are less than 0 or greater than 1. In comparison, Wilson score C.I.s are much more reasonable, although being asymmetric. When the sample size is large, the difference between Wilson score C.I. and normal approximation C.I. are almost negligible.

Wednesday, December 18, 2013

Hadoop Hive vs ANSI SQL

There are probably a lot of semantic difference between Hive and standard SQL (ANSI SQL). I ran into one today. Here is it.

Assume I have two tables, t1 and t2.

SELECT * FROM t1;
+--------------+---------------+
| account_name | account_value |
+--------------+---------------+
| a            |            12 |
| b            |            18 |
| c            |            22 |
| d            |            28 |
+--------------+---------------+

SELECT * FROM t2;
+--------------+---------------+
| account_name | account_entry |
+--------------+---------------+
| b            |           233 |
| b            |           255 |
| d            |           987 |
| d            |           992 |
+--------------+---------------+

SELECT
    t1.account_name, t1.account_value, t2.account_entry
FROM
    t1
    LEFT OUTER JOIN t2
        ON t1.account_name = t2.account_name
        AND t1.account_value = 18;

I expect Hive to return result like
+--------------+---------------+---------------+
| account_name | account_value | account_entry |
+--------------+---------------+---------------+
| a            | 12            |          NULL |
| b            | 18            |           233 |
| b            | 18            |           255 |
| c            | 22            |          NULL |
| d            | 28            |          NULL |
+--------------+---------------+---------------+

However, Hive gives me the following result.
+--------------+---------------+---------------+
| account_name | account_value | account_entry |
+--------------+---------------+---------------+
| b            | 18            |           233 |
| b            | 18            |           255 |
+--------------+---------------+---------------+
Initially, this doesn't make sense to me completely. I am doing a left join and I don't expect any row in my left table got dropped.

After a series of discussion with other analysts in the company, I now understand Hive treats one-side condition in join on clause as pre-filter. Therefore, only one row in the first table got pushed to the join execution. This is pretty different from ANSI SQL where the join on condition is only specifying how the join should be done. In an INNER JOIN case, this doesn't really matter, but pay extra attention when you do a LEFT OUTER JOIN and put a one-side condition on the left table, or do a RIGHT OUTER JOIN and put a one-side condition on the right table.

Saturday, November 30, 2013

Confidence interval for event probability

It is very common that people ask "What's the minimum sample size do I need to estimate the probability of a particular event?" The answer is: it depends on (1) how confident do you want to be and (2) what's the (estimated) true rate of the event.

(1) is obvious if thinking from the perspective of a confidence interval. The most common case is 95% confidence interval. For a binary outcome, where the event can only be 0 and 1, the common formula is:
95% Confidence Interval = x_bar +/- 1.96 * sqrt(x_bar * (1-x_bar) / n)
Please note that x_bar (sample mean) is the MLE of the true probability

The problem here is that the MLE is known to be biased, when the true (and estimated) mean is close to zero or one, and therefore the 95% C.I. is also biased. This means in such cases, the chance that 95% C.I. estimated from sample covers that real (unknown) probability will be lower than 95%.

Let's do a simple simulation exercise to show this.

# This function simulates the construction of C.I. and return 1
# if the C.I. covers true probability and returns 0 if not.
my_func <- function(dummy){
  my_sample <- rbinom(sample_size, 1, prob = true_rate)
  my_mle <- mean(my_sample)
  my_sd <- sqrt(my_mle * (1 - my_mle) / sample_size)
  my_ci_upper <- my_mle + 1.96 * my_sd
  my_ci_lower <- my_mle - 1.96 * my_sd
  return(as.numeric(my_ci_lower <= true_rate &
                    true_rate <= my_ci_upper))
}
# Make four plots to illustrate the cases for
# sample = 200, 500, 2000, 5000
par(mfrow = c(2,2))
plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
     xlab = "True Rate",
     ylab = "Real Confidence Level",
     main = "Sample size = 200")
abline(h = 0.95, col = "red")
sample_size = 200
result <- matrix(NA, 99)
for (j in 1:99){
  true_rate = 0.01 * j
  result[j] <- mean(unlist(lapply(1:10, function(t)                   mean(unlist(lapply(1:1000, my_func)))))) 
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))

plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
     xlab = "True Rate",
     ylab = "Real Confidence Level",
     main = "Sample size = 500")
abline(h = 0.95, col = "red")
sample_size = 500
result <- matrix(NA, 99)
for (j in 1:99){
  true_rate = 0.01 * j
  result[j] <- mean(unlist(lapply(1:10, function(t)                   mean(unlist(lapply(1:1000, my_func))))))  
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))

plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
     xlab = "True Rate",
     ylab = "Real Confidence Level",
     main = "Sample size = 2000")
abline(h = 0.95, col = "red")
sample_size = 2000
result <- matrix(NA, 99)
for (j in 1:99){
  true_rate = 0.01 * j
  result[j] <- mean(unlist(lapply(1:10, function(t)
    mean(unlist(lapply(1:1000, my_func))))))  
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))

plot(NA, xlim = c(0,1), ylim = c(0.85, 0.975),
     xlab = "True Rate",
     ylab = "Real Confidence Level",
     main = "Sample size = 5000")
abline(h = 0.95, col = "red")
sample_size = 5000
result <- matrix(NA, 99)
for (j in 1:99){
  true_rate = 0.01 * j
  result[j] <- mean(unlist(lapply(1:10, function(t)
    mean(unlist(lapply(1:1000, my_func))))))  
}
lines(result ~ seq(from = 0.01, to = 0.99, by = 0.01))


In these charts, the read line represent 95% confidence level, and the black lines represents the true probability the the 95% C.I. estimated using the formula will covers the true probability. In the ideal case, the black lines should overlap with the read lines. But as these charts show, they are significantly off when the sample sizes are small and the true probability is approaching zero or one. 

What is a data frame in R

Data frames in R are simply lists of vectors (with some extra restriction and attributes). Any operation on list of vectors can be called "as-is" on data frames. For example:
a = list(v1 = c(1, 2, 3), v2 = c(4, 5, 6))
b = data.frame(v1 = c(1, 2, 3), v2 = c(4, 5, 6))

Then, a[[1]], a[1], a["v1"], a$v1 are all legal calls to a, and so are to b. We can even check
> is.list(b)
[1] TRUE

and they contain same objects
> objects(a)
[1] "v1" "v2"
> objects(b)
[1] "v1" "v2"

What attributes do data frame have in addition? row names and a class name of "data.frame"
> attributes(a)
$names
[1] "v1" "v2"

> attributes(b)
$names
[1] "v1" "v2"

$row.names
[1] 1 2 3

$class
[1] "data.frame"

So, can we give such attributes of a list of variables and fully convert it into a data frame? Yes!

> is.data.frame(a)
[1] FALSE
> class(a) <- "data.frame"
> row.names(a) <- c(1,2,3)
> is.data.frame(a)
[1] TRUE

Just one extra requirement (obviously): each vector in the list must have the same length.

Monday, November 25, 2013

When we do object assignment in R

For most people who are new to R, the first thing they do (unless they do a "Hello, World!" exercise) is to assignment some value to an object, such as
a <- 1; b <- a

It is obvious here that we assign a value 1, and then assign b as the same value as a, which is also 1. But, it is an interesting to think about how exactly does R execute b <- a. Actually, R doesn't really make a clone copy of a and assign to b. Instead, R simply lets b point to an object a, without making a physical copy. In other words, both a and b point to the same physical allocation on the disk. But what will happen when we change the value of one object? Yes, as we will probably guess, at that time R makes a new copy off the original one.

To illustrate this example, we make a (relatively) large object and check R's memory usage to confirm whether multiple copies of that large object is made.

> rm(list = ls()) # remove all existing objects to clean memory
> a <- rnorm(1e+08); gc() # this is a large object
            used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    282199  15.1     531268   28.4    467875   25.0
Vcells 100603178 767.6  210878317 1608.9 200764441 1531.8
> b1 <- a; gc()
            used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    282203  15.1     531268   28.4    467875   25.0
Vcells 100603179 767.6  210878317 1608.9 200764441 1531.8

We see that memory usage is not increased

> b2 <- b1; gc()
            used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    282208  15.1     531268   28.4    467875   25.0
Vcells 100603180 767.6  210878317 1608.9 200764441 1531.8
> b3 <- b2; gc()
            used  (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    282213  15.1     531268   28.4    467875   25.0
Vcells 100603181 767.6  210878317 1608.9 200764441 1531.8

Now, we have a, b1, b2, b3 both point the same object, with only 1 real memory allocation.

> a[1] <- 1
> a[1] <- 1; gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    282217   15.1     531268   28.4    467875   25.0
Vcells 200603182 1530.5  315878475 2410.0 300763181 2294.7

We see that as we changed a, a new copy of comparable size is created, but b1, b2, b3 are still sharing one physical allocation

> b1[1] <- 1; gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    282218   15.1     531268   28.4    467875   25.0
Vcells 300603182 2293.5  331752398 2531.1 300923553 2295.9
> b2[1] <- 1; gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    282219   15.1     531268   28.4    467875   25.0
Vcells 400603182 3056.4  442002427 3372.3 400763207 3057.6

Finally, a, b1, b2, b3 now point to 4 physical memory allocation for four different objects.

Pass by reference in R

Sometimes, people may want a pass-by-reference in R, but unfortunately, R doesn't fully support this. However, there are various work arounds on this, and here is an example:

Let's say we want to make a function that change all NA values in a vector to 0. The most common way to do this in R is like this:

a <- c(1, NA, 2, NA, 3)
na2zero <- function(x){
  x[is.na(x)] <- 0
  return(x)
}
a <- na2zero(a); print(a)

However, a big problem of this approach is that an intermediate object with the same length of a is created and then be assigned to a. This may cause a problem when the size of a is very large. Moreover, in many cases where recursion is needed, this approach may not work. The question is how can we write a function which modifies an external object inside of the function.

Will global assignment ( <<- ) work? Unfortunately, no, at least not directly. Here is the example:

x <- c(1, NA, 2, NA, 3)
a <- c(1, NA, 2, NA, 3)
na2zero <- function(x){
  x[is.na(x)] <<- 0
}
na2zero(a); cat("x=", x, "\n"); cat("a=", a, "\n")

We notice that we call the function on a, but instead of changing the NA values in a, the function takes the x literally and changes external object x. How can we make global assignment dynamic, instead of sticking to the name "x"? Here is how:

x <- c(1, NA, 2, NA, 3)
a <- c(1, NA, 2, NA, 3)
na2zero <- function(x){
  tmp <- substitute(expression(x[is.na(x)] <<- 0))
  eval(eval(tmp))
}
na2zero(a); cat("x=", x, "\n"); cat("a=", a, "\n")
na2zero(x); cat("x=", x, "\n"); cat("a=", a, "\n")

Why does it work this way? It is related to R's lexical scoping. More details can be found in R's manual, or do help.search("scoping") in R.

Efficiency in R

One common criticism on R is on its performance. Yes, compare with some other programming languages, R is not designed to optimize for performance. But there are still lots of tricks (and pitfalls) in R that we can pay attention to improve its performance.

When an object is passed into a function, it is not automatically copied to a new object, unless some modification of this object, or implicit declaration of a derivative object is needed inside the function. Therefore, please avoid unnecessary modification of a large object, or try to construct another large object based on it, since some seems-to-be-minor operation may actually trigger a lot of copy load. See example:

a <- rnorm(1e+09) # make a large object
fun1 <- function(vec) {
  mean(vec) + 1
}

fun2 <- function(vec) {
  mean(vec + 1)
}

> system.time(fun1(a))
   user  system elapsed 
  9.594  32.137 123.823 
> system.time(fun2(a))
   user  system elapsed 
 12.039  44.128 203.475 

The first function takes vec and scan it to calculate the mean. The second function first construct a new object vec + 1, and then pass that into the mean() function. When the vec object is large, computing vec + 1 is expensive.