Signals & Noise: December 2013

There are probably a lot of semantic difference between Hive and standard SQL (ANSI SQL). I ran into one today. Here is it.

Assume I have two tables, t1 and t2.

SELECT * FROM t1;
+--------------+---------------+
| account_name | account_value |
+--------------+---------------+
| a | 12 |
| b | 18 |
| c | 22 |
| d | 28 |
+--------------+---------------+

SELECT * FROM t2;
+--------------+---------------+
| account_name | account_entry |
+--------------+---------------+
| b | 233 |
| b | 255 |
| d | 987 |
| d | 992 |
+--------------+---------------+

SELECT
t1.account_name, t1.account_value, t2.account_entry
FROM
t1
LEFT OUTER JOIN t2
ON t1.account_name = t2.account_name
AND t1.account_value = 18;

I expect Hive to return result like
+--------------+---------------+---------------+
| account_name | account_value | account_entry |
+--------------+---------------+---------------+
| a | 12 | NULL |
| b | 18 | 233 |
| b | 18 | 255 |
| c | 22 | NULL |
| d | 28 | NULL |
+--------------+---------------+---------------+

However, Hive gives me the following result.
+--------------+---------------+---------------+
| account_name | account_value | account_entry |
+--------------+---------------+---------------+
| b | 18 | 233 |
| b | 18 | 255 |
+--------------+---------------+---------------+
Initially, this doesn't make sense to me completely. I am doing a left join and I don't expect any row in my left table got dropped.

After a series of discussion with other analysts in the company, I now understand Hive treats one-side condition in join on clause as pre-filter. Therefore, only one row in the first table got pushed to the join execution. This is pretty different from ANSI SQL where the join on condition is only specifying how the join should be done. In an INNER JOIN case, this doesn't really matter, but pay extra attention when you do a LEFT OUTER JOIN and put a one-side condition on the left table, or do a RIGHT OUTER JOIN and put a one-side condition on the right table.

Signals & Noise

Wednesday, December 18, 2013

Hadoop Hive vs ANSI SQL

About Me