Two months ago, Kalev Leetaru of Georgetown University announced the availability of the entire quarter-billion-record GDELT Event Database in Google BigQuery. This dataset monitors the broadcast, print, and web news media from across the world in over 100 languages. It's a database of what’s happening throughout the globe - a continuously-updated, computable catalog of human society compiled from the world’s news media.

With the GDELT database publicly accessible through BigQuery, you can query and dig through a quarter-billion records in real-time. To explore what BigQuery can do, GDELT used its ability to compute correlations. Computing correlations allows us, for example, to look at a timeline of events in Egypt before the revolution of 2011 and then search 35 years of history for other countries around the world with similar patterns.

With a single SQL query, the GDELT team has been doing exactly that: using BigQuery to run more than 2.5 million correlations in a few minutes to trace the patterns of global society as captured by GDELT’s archive. Instead of only examining small slices of the data suggested by theory or domain expertise, this experiment showcases the use of GDELT's raw data to leverage the enormous power of BigQuery to exhaustively sift out every correlation from the entire quarter-billion record dataset, surfacing highly unexpected patterns and findings.

On their in-depth post, the GDELT team runs a query like this:
  STRFTIME_UTC_USEC(a.ending_at, "%Y-%m-%d") ending_at1,
  STRFTIME_UTC_USEC(b.ending_at-60*86400000000, "%Y-%m-%d") starting_at2,
  STRFTIME_UTC_USEC(b.ending_at, "%Y-%m-%d") ending_at2,,, CORR(a.c, b.c) corr, COUNT(*) c
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 60) b
) b
  SELECT country, date+i*86400000000 ending_at, c, i
  FROM [gdelt-bq:sample_views.country_date_matconf_numarts] a 
  CROSS JOIN (SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i < 60) b
  WHERE country='Egypt'
  AND date+i*86400000000 = PARSE_UTC_USEC('2011-01-27')
) a
ON a.i=b.i
WHERE a.ending_at != b.ending_at
GROUP EACH BY ending_at1, ending_at2, starting_at2,,
HAVING (c = 60 AND ABS(corr) > 0.254)
This query has 2 subqueries: The smaller one finds the timeline of 30 days in Egypt before 2011-01-27, while the left side collects all sets of 30 days events for every country through GDELT's ever-growing dataset. With a cross join between the first set and all the sets on the left side, BigQuery is capable of sifting through this over a million combinations computed in real-time and calculate the Pearson correlation of each timeline pair. For a visual explanation, see the linked IPython notebook.

After running this query, the GDELT team obtained from BigQuery a list of all the worldwide periods from the last 35 years, as monitored by GDELT, that have been most similar to Egypt’s two months preceding the core of its revolution. Mathematically these periods present a statistically significant correlation with this specific time, and GDELT team proceeded to look into the details of why and its meaning. Read Kalev's post on the official GDELT blog.

You can run your own experiments with based on the GDELT database or other public datasets with your free monthly terabyte to query with Google BigQuery.

-Posted by Felipe Hoffa, Developer Advocate