Google Cloud Platform Blog: Message Bus offers richer, faster data feeds on email open rates with Google BigQuery

Today’s guest blog comes from Ujjwal Sarin, software engineer at Message Bus, which provides cloud-based email delivery and message infrastructure to boost email open rates and lower the costs for users. Message Bus helps its customers use messaging to build strong relationships with their own customers.

Message Bus customers thrive on data – real-time information about who’s opening their emails and who isn’t, and how campaigns are performing over time. We provide highly reliable email delivery that helps improve our customers’ message open rates by as much as 20%, meaning customers can do a better job of promoting their products and services. Google BigQuery frees us from the time-consuming work of managing data, which allows us to put brainpower and more money back into our business.

Feeding data to businesses when and how they want it helps us win business, but it also adds to the workload of storing and managing information. On average, we’d often accumulate about a terabyte of data a day, which took a good deal of time and money to keep track of. To manage the data associated with email delivery, we used a homegrown PostgreSQL-based data warehouse. Before our migration to BigQuery, the majority of our budget was spent on hardware, infrastructure, and personnel.

Building messaging systems is what we’re good at, and we wanted our engineers to spend more time coming up with new ways to tell customers how their campaigns were doing. That meant replacing our homegrown data warehouse with a hosted data backend. We looked at Amazon Redshift and Joyent Manta, but BigQuery won us over due to its rich querying capabilities, practically zero administration for performance, affordable data storage and scalability. BigQuery is built on top of the battle-tested Dremel and Google File System, has a stable API and is inexpensive, and we get infinite scalability for data that can scale when we need it to without having to provision additional capacity.

BigQuery has helped us ramp up the amount and type of data we can supply to our customers. It’s now faster and easier for our customers to access this information, since we built an easy-to-understand user-facing API, with BigQuery serving as the data warehouse.

Prior to BigQuery, we had implementation problems with our Postgres and couldn't serve out large volumes of data with our Postgres cluster. Moving to BQ solved this for us. There’s no limit to the amount of data we can offer – we can even go several months back if customers want to do some historical analysis. Looking in-depth at the history of their campaigns helps businesses make better decisions about how they’ll improve messages in the future. We also give customers access to data streams with real-time information about email delivery, as well as aggregate statistics every hour. For real-time data, we provide webhooks to our users that are triggered from our system to our user endpoints. For aggregate stats, we run an hourly job to compile results from BigQuery into a MySQL table.

To service our feed of customer stats, we periodically run queries on Google BigQuery and store the aggregate counts in a relational database in the cloud. We’re using the BigQuery JSON API. For our feed on customers’ historical data, we proxy requests to Google BigQuery via an internal job manager that handles user quotas, retries and the number of concurrent queries.

When a user queries our system for stats or raw data, we issue an internal GUID (globally unique identifier), which acts as proxy to the query that runs on BigQuery. The user of our API checks the status of the job, and upon successful completion, we proxy the resultset back via our API.

Here is an example of an hourly query triggered by a cron that we use to extract data about open rates and then populate our aggregates database.:

SELECT a.send_hour send_hour,
       a.event_hour event_hour,
       a.account_id account_id,
       a.channel_id channel_id,
       a.session_id session_id,
       a.event_type event_type,
       d.channelKey channel_key,
       d.sessionKey session_key,
       a.event_count event_count
FROM
  (SELECT sendHour send_hour,
          INTEGER(UTC_USEC_TO_HOUR (minEventTime*1000)/1000) AS event_hour,
          accountId account_id,
          channelId channel_id,
          sessionId session_id,
          'uniqueopen' event_type,
                       count(1) event_count
   FROM
     (SELECT min(eventTime) AS minEventTime,
             messageGuid,
             channelId,
             sessionId,
             accountId,
             sendHour
      FROM
      WHERE eventType='open'
        AND eventTime-sendTime < 1209600000
      GROUP EACH BY messageGuid,
                    channelId,
                    sessionId,
                    accountId,
                    sendHour HAVING minEventTime>=TIMESTAMP_TO_MSEC(TIMESTAMP('20140416')))
   GROUP EACH BY send_hour,
                 event_hour,
                 account_id,
                 channel_id,
                 session_id) a
LEFT JOIN EACH [messagebus_com_prod_ACCOUNT_DATA_DS.ACCOUNT_DETAIL] d ON d.sessionId = a.session_id
AND d.channelId = a.channel_id

Not only is our data more detailed and easier to view, it costs much less, saving us $20,000 a month compared to our homegrown database. We also don’t need an operations and database administrator to manage our database, in addition to the costly database hardware we were running before. So we’re easily saving about $40,000 a month, all told.

Ten months after we started using Google BigQuery, we have three rich data feeds that are available in just minutes when customers request them. We’re making good use of the time we’re no longer spending managing data – in fact, we have fresh ideas in the works for even more data feeds that can drive better decision-making for our customers.

-Contributed by Ujjwal Sarin, software engineer at Message Bus