Introducing CORR() to Google BigQuery
Thursday, September 5, 2013
Editor's note: Join Felipe Hoffa and Michael Manoochehri tomorrow, September 6th at 11:00AM PST on Google Developers Live where they'll discuss correlation with BigQuery.
Google BigQuery is designed to make it easy to analyze large amounts of data quickly. We are always looking into how to make BigQuery even more powerful, so today we'll introduce a feature that we couldn't wait to share with you: Pearson correlation.
BigQuery is transforming the ways in which we work with massive amounts of data. Our partners have created amazing tools to make that process even more streamlined: Visualizing, slicing, and dicing. Working with your intuition and these tools, you can discover surprising new insights, analyzing terabytes of data in mere seconds. What's still a challenge is feeding this intuition, discovering where to look for insights.
The new CORR() function in BigQuery is a powerful tool for your intuition process: Which variables are similar, or have surprising behaviors? Can you rank these surprising behaviors? What are the best variables to predict the future?
Let's look at the data we collected at the Data Sensing Lab at the Moscone Center for Google I/O. We had multiple sensors in multiple rooms, collecting temperature, humidity, noise and other data during these 3 days.
To start, we can look at what rooms' temperature behaved in a similar way during the 2nd and the 3rd day:
SELECT CORR(a.data, b.data) corr, a.room room, count(*) c FROM ( SELECT TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(DATA) data, room FROM [io_sensor_data.moscone_io13] WHERE DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16' AND sensortype='temperature' GROUP EACH BY time, room) a JOIN EACH ( SELECT TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(data) data, room FROM [io_sensor_data.moscone_io13] WHERE DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-17' AND sensortype='temperature' GROUP EACH BY time, room) b ON a.time=b.time AND a.room = b.room GROUP EACH BY room HAVING corr IS NOT NULL AND c > 800 ORDER EACH BY corr DESC
corr | room | c |
---|---|---|
0.9387693711 | sf desk 1st floor | 1331 |
0.8488553811 | chrome east | 1418 |
0.8423597116 | chrome hobbit | 1372 |
0.8162574011 | chrome west | 1401 |
0.7696065852 | chrome north | 1374 |
... | ... | ... |
-0.1048712561 | Room 1 | 1390 |
-0.1508345595 | keynote crowd | 1358 |
-0.5467798237 | android east | 1402 |
This table says that many rooms behaved in a similar way during both days. For example, a room inside the main exhibition area:
The room 'android east' shows a highly negative correlation. We can visualize why:
We can repeat this experiment with other dimensions, like humidity or noise. But we can go further, and find out what dimensions correlate with each other.
SELECT CORR(a.data, b.data) corr, a.sensortype a_sensortype, b.sensortype b_sensortype, a.room room, count(*) c FROM ( SELECT TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(DATA) data, room, sensortype FROM [io_sensor_data.moscone_io13] WHERE DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16' AND sensortype != 'pressure' AND sensortype != 'altitude' AND room != 'None' GROUP EACH BY time, room, sensortype) a JOIN EACH ( SELECT TIME(USEC_TO_TIMESTAMP(INTEGER(Timestamp / 60000000) * 60000000)) time, AVG(data) data, room, sensortype FROM [io_sensor_data.moscone_io13] WHERE DATE(USEC_TO_TIMESTAMP(Timestamp- 8*60*60000000)) = '2013-05-16' AND sensortype != 'pressure' AND sensortype != 'altitude' AND room != 'None' GROUP EACH BY time, room, sensortype) b ON a.time=b.time AND a.room = b.room WHERE a.sensortype > b.sensortype GROUP EACH BY room, a_sensortype, b_sensortype HAVING corr IS NOT NULL AND c > 800 ORDER EACH BY corr DESC
corr | a_sensortype | b_sensortype | room | c |
---|---|---|---|---|
0.9452553258 | temperature | humidity | android east | 1436 |
0.9230651809 | temperature | gas | sf desk 1st floor | 1425 |
... | ... | ... | ... | ... |
0.7949466754 | temperature | mic | maps | 1434 |
... | ... | ... | ... | ... |
0.0013869924 | temperature | light | Room 12 | 1436 |
... | ... | ... | ... | ... |
-0.9396195218 | temperature | humidity | accesability | 1341 |
-0.9564087934 | temperature | humidity | chrome lounge | 1135 |
-0.9925864599 | temperature | humidity | sf desk 1st floor | 1437 |
We can visualize how the air quality in the reception improved, once the doors were opened - as the drop in temperature shows:
At the end of the table, it's a surprise that while humidity had a high negative correlation to temperature in many sectors, it had a high positive correlation in 'android east'. First we can look at the normal behavior in the 'chrome lounge sector':
That's the nature of real data. It's a hard job to keep it accurate and clean. CORR() is a tool that can also call our attention to what is going wrong.
We can go much further with this: What's the relationship between rooms? What sensors in which rooms can help us predict what will happen an hour later on a different room? For this and much more tune in for an upcoming Google Developers Live session where we'll discuss how we built this queries, and how to go beyond. Stay tuned!
-Posted by Felipe Hoffa, Developer Programs Engineer