Today, we are making it easier for you to run Hadoop jobs directly against your data in Google BigQuery and Google Cloud Datastore with the Preview release of Google BigQuery connector and Google Cloud Datastore connector for Hadoop. The Google BigQuery and Google Cloud Datastore connectors implement Hadoop’s InputFormat and OutputFormat interfaces for accessing data. These two connectors complement the existing Google Cloud Storage connector for Hadoop, which implements the Hadoop Distributed File System interface for accessing data in Google Cloud Storage.

The connectors can be automatically installed and configured when deploying your Hadoop cluster using bdutil simply by including the extra “env” files:
  • ./bdutil deploy bigquery_env.sh
  • ./bdutil deploy datastore_env.sh
  • ./bdutil deploy bigquery_env.sh datastore_env.sh

Selection_027.png
Diagram of Hadoop on Google Cloud Platform

These three connectors allow you to directly access data stored in Google Cloud Platform’s storage services from Hadoop and other Big Data open source software that use Hadoop's IO abstractions. As a result, your valuable data is available simultaneously to multiple Big Data clusters and other services, without duplications. This should dramatically simplify the operational model for your Big Data processing on Google Cloud Platform.

Here are some word-count MapReduce code samples to get you started:

As always, we would love to hear your feedback and ideas on improving these connectors and making Hadoop run better on Google Cloud Platform.

-Posted by Pratul Dublish, Product Manager