Posted:
In 2015, we're introducing a monthly webinar series to take an in-depth look at diverse elements that help us solve complex business challenges in the cloud and nurture business growth. We’ll cover unique IT management and implementation strategies and the people, tools, and applications that increase impact. We're opening it up to a live online and global forum with the aim to foster collaborative learning through use cases we can all relate to and real-time Q/A sessions. Our first webinar features, zulily, a high-growth online retailer that leverages big data to provide a uniquely tailored product and customer experience to a mass market around the clock.

Zulily is one of the largest e-commerce companies in the United States. Its business is retail, but its DNA is in technology, using data and predictive analytics to drive decisions. As the company grows, so does the amount and complexity of data. Zulily’s IT realized that in order to keep up and properly scale, they had to redesign the way they process, analyze and use big data.

Zulily transitioned to the Google Cloud Platform to meet these challenges and ultimately use the big data it collected to improve online customer experience. Join us as we take a technical deep dive into zulily’s new application infrastructure built on the Google Cloud Platform. The team will share key learnings and discuss how they plan to scale their efforts and impact.

Big data experts from Google Cloud Platform and zulily will share:

  • Best practices and implementation strategies to drive value from big data using products such as Google BigQuery and Hadoop
  • How zulily uses Google Cloud Platform to improve customer experience, increase sales, and increase relevance via marketing initiatives
  • Key leadership and technical benefits and risks to be aware of as you plan, execute and optimize your big data implementation strategy across one or multiple business units

Live Webinar: zulily turns big data into a big advantage with Google Cloud Platform

  • Wednesday, January 28, 2015
  • 10:30 - 11:00 a.m. PT
  • Speakers: William Vambenepe, Lead Product Manager for Google Cloud Big Data Services and Sudhir Hasbe, Director Software Engineering for Data Services, BI and Big Data Analytics for zulily

Register here

Posted:
Last week, we kicked off our series to introduce container technologies, which are changing the way that people deploy and manage applications. Docker has emerged as a popular technology for application containerization, revolutionizing how applications are built, deployed and managed. Google Cloud Platform offers rich support for Docker containers through the fully managed Google Container Engine service powered by Kubernetes, container optimized VMs on Google Compute Engine, and Managed VMs for Google App Engine.

Today we are announcing the beta release of a new service: Google Container Registry for the secure hosting, sharing, and management of private container repositories.

The registry service provides three key benefits to Google Cloud Platform customers:

  • Access control: The registry service hosts your private images in Google Cloud Storage under your Google Cloud Platform project. This ensures by default that your private images can only be accessed by members of your project, enabling them to securely push and pull images through the Google Cloud SDK command line. Container host VMs can then access secured images without additional effort.
  • Server-side encryption: Your private images are automatically encrypted before they are written to disk.
  • Fast and reliable deployment: Your private images are stored in Google Cloud Storage and cached in our datacenters, ready to be deployed to Google Container Engine clusters or Google Compute Engine container optimized VMs over Google Cloud Platform’s Andromeda based network fabric.

zulily, an online retailer that offers thousands of new and unique products each day, was an early adopter of the registry service. “Docker registry availability, security, performance, and durability become more and more critical as more of our Compute Engine applications are containerized with Docker. Private registries help, but they need valid certificates, authentication and firewalls, backups, and monitoring. Google's container registry provides us with a complete Docker registry that we integrate into our development and deployment workflow with little effort," said Steve Reed, Principal Engineer, Core Engineering at zulily.

During the Container Registry beta, there is no extra cost for using the registry service besides the Google Cloud Storage charges for storage and network resources consumed by your private images.

To get started, you will need a Google Cloud Platform project with billing enabled. If you don’t have one already, you can use the free trial to create one. You will also need to install Docker and Google Cloud SDK.

Go ahead, take a look at our documentation and start using the registry for managing your private Docker images. The registry service team looks forward to receiving your direct feedback.

-Posted by Pratul Dublish, Technical Program Manager

Posted:
Google Cloud Platform provides a reliable and scalable compute, storage and network infrastructure for all your big data needs. We have worked extensively with the Open Source community to optimize the Hadoop ecosystem onto Cloud Platform. In 2014, we helped simplify the deployment of Apache Hadoop and Apache Spark on Google Cloud Platform by introducing bdutil, a Command Line toolset to accelerate deployment. To reduce cluster startup time, increase interoperability, and streamline storage of the source data and subsequent results, we have also provided connectors to Google Cloud Storage and Google BigQuery.

Today, we’re happy to announce the availability of Hortonworks Data Platform, HDP 2.2, on Google Cloud Platform. HDP 2.2. has been certified by Hortonworks for use on Google Cloud Platform, along with usage of bdutil deployment toolset and the Google Cloud Storage connector. Google and Hortonworks believe in providing a seamless experience for starting and running your hadoop tasks on the cloud. We want users to be focused on developing and analyzing their data, rather than worrying about bringing up Hadoop clusters.

You can take advantage of integrated and certified HDP plugin with bdutil and start deployment of standard clusters in a matter of minutes, with the following command line:

./bdutil deploy -e platforms/hdp/ambari_env.sh

By default, bdutil will deploy a cluster with 5 nodes, per HDP recommendations, along with the latest version of HDP and recommended HDP components. Once deployed, the cluster is ready to run Pig Scripts, MapReduce jobs, Hive Queries, or additional Hadoop services supported by HDP. You’ll also have access to the Ambari GUI to perform additional configuration and setup activities.
For additional information, please visit our bdutil Hortonworks documentation. You can download the bdutil setup scripts in zip format or tar.gz format.

To find out more about our joint collaboration, go here.

-Posted by Ram Ramanathan, Product Manager

Posted:
Last week, Miles Ward, Google Cloud Platform’s Global Head of Solutions, kicked off our Container series with a post about the overarching concepts around containers, Docker, and Kubernetes. If you have not yet had a chance to read his post, we suggest you start there to arm yourself with the knowledge you will need for this post!

This week, Joe Beda, Senior Staff Engineer and one of the founding members of the Kubernetes project, will go a level deeper and talk in depth about the core technical concepts that underpin Google’s use of containers. These have informed the creation of Kubernetes and provide a foundation of future posts in this series.

What makes a container cluster?

The recent rise of container systems like Docker has (rightly) created a lot of excitement. The ability to package, transfer and run application code across many different environments enables new levels of fluidity in how we manage applications. But, as users expand their use of containers into production, new problems crop up in terms of managing which containers run where, dealing with large numbers of containers and facilitating communication between containers across hosts. This is where Kubernetes comes in. Kubernetes is an open source toolkit from Google that helps to solve these problems.

As we discussed in last week’s post, we consider Kubernetes a "container cluster manager." Lots of folks call projects in this area "orchestration systems," but that has never rung true for me. Orchestral music is meticulously planned with the score decided and distributed to the musicians before the performance starts. Managing a Kubernetes cluster is more like an improv jazz performance. It is a dynamic system that reacts to conditions and inputs in real time.

So, what makes a container cluster? Is it a dynamic system that places and oversees sets of containers and the connections between them? Sure, that and a bunch of compute nodes (either raw physical servers or virtual machines). In the remainder of this post, we’ll explore 3 things: what makes up a container cluster, how to work with them and how the interconnected elements work together. Additionally, based on our experience, a container cluster should include a management layer, and we will dig into the implications of this below.

Why run a container cluster?

Here at Google, we build container clusters around a common set of requirements: always be available, be able to be patched and updated, scale to meet demand, be easily instrumented and monitorable, and so on. While containers allow for applications to be easily and rapidly deployed and broken down into smaller pieces for more granular management, you still need a solution for managing your containers so that they meet these goals.

Over the past ten years at Google, we've found that having a container cluster manager addresses these requirements and provides a number of benefits:

  • Microservices in order to keep moving parts manageable. Having a cluster manager enables us to break down an application into smaller parts that are separately manageable and scalable. This lets us scale up our organization by having clear interfaces between smaller teams of engineers.
  • Self healing systems in the face of failures. The cluster manager automatically restarts work from failed machines on healthy machines.
  • Low friction horizontal scaling. A container cluster provides tools for horizontal scaling, such that adding more capacity can be as easy as changing a setting (replication count).
  • High utilization and efficiency rates. Google was able to dramatically increase resource utilization and efficiency after moving to containers.
  • Specialized roles for cluster and application operations teams. Developers are able to focus much more on the service they are building rather than on the underlying infrastructure that supports it. For example, the GMail operations and development teams rarely have to talk directly to the cluster operations team. Having a separation of concerns here allows (but doesn't force) operation teams to be more widely leveraged.

Now, we understand that some of what we do is unique, so lets explore the ingredients of a great container cluster manager and what you should focus on to realize the benefits of running containers in clusters.

Ingredient 1: Dynamic container placement

To build a successful cluster, you need a little bit of that jazz improv. You should be able to package up your workload in a container image and declaratively specify your intents around how and where it is going to run. The cluster management system should decide where to actually run your workload. We call this "cluster scheduling."

This doesn't mean that things are placed arbitrarily. On the contrary, there are a whole set of constraints that come in to play to make cluster scheduling a very interesting and hard problem1 from a computer science point of view. When scheduling, the scheduler makes sure to place your workload on a VM or physical machine with enough spare capacity (e.g. CPU, RAM, I/O, storage). But, in order to meet a reliability objective, the scheduler might also need to spread a set of jobs across machines or racks in order to reduce risk from correlated failures. Or perhaps some machines have special hardware (e.g. GPUs, local SSD, etc.). The scheduler should also react to changing conditions and reschedule work to deal with failures, growing/shrinking the cluster or increased efficiency. To enable this, we encourage users to avoid pinning a container to a specific machine. Sometimes you have to fall back on "I want that container on that machine" but that should be a rare exception.

The next question is: what are we scheduling? The easy answer here is individual containers. But often times, you want to have a set of containers running as a team on the same host. Examples include a data loader with a data server or a log compressor/saver process paired with a server. These containers usually need to be located together, and you want to ensure that they do not become separated during the dynamic placement. To enable this, we introduced in Kubernetes a concept known as a pod. A pod is a set of containers that are placed and scheduled together as a unit on a worker computer (also known as a Kubernetes node). By working to place a group of pods, Kubernetes can pack lots of work onto a node in a reliable way.

Ingredient 2: Thinking in sets

When working on a single physical node, tools generally don't operate on containers in bulk. But when moving to a container cluster you want to easily scale out across nodes. To do this, you need to work in terms of sets of things instead of singletons. And you want to keep those sets similarly configured. In Kubernetes, we manage sets of pods using two additional concepts: labels and replication controllers.

Every pod in Kubernetes has a set of key/value pairs associated with it that we call labels. You can select a set of pods by constructing a query based on these labels. Kubernetes has no opinion on the "correct" way to organize pods. It is up to you to organize your pods in a way that makes sense to you. You can organize by application tiers, geographic location, development environment etc. In fact, as labels are non-hierarchical, you organize your pods in multiple ways simultaneously.

Example: let's say you have a simple service that has a frontend and a backend. But you also have different environments – test, staging and production. You can label your production frontend pods with env=prod tier=fe and your production backend pods with env=prod tier=be. You could similarly label your test and staging environments. Then, when operating on or inspecting your cluster, you could just restrict yourself to the pods where env=prod to see both the frontend and backend. Or you can look at all of your frontends across test, staging and production. You can imagine how this organization system can adapt as you add more tiers and environments.

Figure 1 - Filtering pods using labels
Scaling
Now that we have a way of identifying and maintaining a pool of similarly configured machines, we can use this functionality for horizontally scaling (i.e. “scaling out”). To make this easy, we have a helper object in Kubernetes called the replication controller. It maintains a pool of these pods based on a desired replication count, a pod template and a label selector/query. It is really pretty easy to wrap your head around. Here is some pseudo-code:

object replication_controller {
  property num_replicas
  property template
  property label_selector

  runReplicationController(num_desired_pods, template, label_selector) {
    loop forever {
      num_pods = length(query(label_selector))
      if num_pods > num_desired_pods {
        kill_pods(num_pods - num_desired_pods)
      } else if num_pods < num_desired_pods {
        create_pods(template, num_desired_pods - num_pods)
      }
    }
  }
}
So, for example, if you wanted to run a php frontend tier with 3 pods, you would create a replication controller with an appropriate pod template (pointing at your php container image) and a num_replicas count of 3. You would identify the set of pods that this replication controller is managing with a label query of env=prod tier=fe. The replication controller takes an easy to understand desired state and tirelessly works to make it true. And if you want to scale in or out all you have to do is change the desired replication count, and the replication controller will take care of the rest. By focusing on the desired state of the system, we end up with something that is easier to reason about.
Figure 2 - The Replication Controller enforces desired state

Ingredient 3: Connecting within a cluster

You can do a lot of interesting things with the ingredients listed so far. Any sort of highly parallelizable work distribution (continuous integration systems, video encoding, etc.) can work without a lot of communication between individual pods. However, most sophisticated applications are more of a network of smaller services (microservices) that communicate with each other. The tiers of traditional application architectures are really nodes in a graph.

A cluster management system needs a naming resolution system that works with the ingredients described above. Just like DNS provides the resolution of domain names to IP addresses, this naming service resolves service names to targets, with some additional requirements. Specifically, changes should be propagated almost immediately when things start or are moved and a "service name" should resolve to a set of targets, possibly with extra metadata about those targets (e.g. shard assignment). For the Kubernetes API, this is done with a combination of label selectors and the watch API pattern.1 This provides a very lightweight form of service discovery.

Most clients aren't going to be rewritten immediately (or ever) to take advantage of a new naming API. Most programs want a single address and port to talk to in order to communicate with another tier. To bridge this gap, Kubernetes introduces the idea of a service proxy. This is a simple network load balancer/proxy that does the name query for you and exposes it as a single stable IP/port (with DNS) on the network. Currently, this proxy does simple round robin balancing across all backends identified by a label selector. Over time, Kubernetes plans to allow for custom proxies/ambassadors that can make smarter domain specific decisions (keep an eye on the Kubernetes roadmap for details as the community defines this). One example that I'd love to see is a MySQL aware ambassador that knows how to send write traffic to the master and read traffic to read slaves.

Voila!

Now you can see how the three key components of a cluster management system fit together: dynamic container placement, thinking in sets of containers, and connecting within a cluster.

We asked the question at the top of this post, "What makes a container cluster?" Hopefully from the details and information we’ve provided, you have an answer. Simply put, a container cluster is a dynamic system that places and manages containers, grouped together in pods, running on nodes, along with all the interconnections and communication channels.

When we started Kubernetes with the goal of externalizing Google's experiences with containers, we initially focused on just scheduling and dynamic placement. However, when thinking through the various systems that are absolutely necessary to build a real application, we immediately saw that it was necessary to add the other additional ingredients of pods, labels and the replication controller. To my mind, these are the bare minimum necessary to build a usable container cluster manager. 

Kubernetes is still baking in the oven, but is coming together nicely. We just released v0.8, which you can download here. We’re still adding features and refining those that we have. We’ve published our roadmap to v1.0. The project has quickly established a large and growing community of contributing partners (such as Red Hat, VMware, Microsoft, IBM, CoreOS, and others) and customers, who use Kubernetes in a variety of environments.

While we have a lot of experience in this space, Google doesn't have all the answers. There are requirements and considerations that we don't see internally. With that in mind, please check out what we are building and get involved! Try it out, file bug reports, ask for help or send a pull request (PR).

-Posted by Joe Beda, Senior Staff Engineer and Kubernetes Cofounder

1 This is the classic knapsack problem which is NP-hard in the general case. 2 The "Watch API pattern" is a way to deliver async events from a service. It is common on lock server systems (zookeeper, etc.) that are derived from the original Google Chubby paper. The client essentially reaches out and "hangs" a request until there are changes. This is usually coupled with version numbers so that the client stay current on any changes.

Posted:
Today’s post is by Sunil Sayyaparaju, Director of Product and Technology at Aerospike, the open source, flash-optimized, in-memory NoSQL database.

Aerospike, now available as a Click to Deploy of Aerospike on Google Compute Engine, is an open source NoSQL database built to push the limits of modern processors and storage technologies, including SSDs, and developers are increasingly choosing NoSQL databases to power cloud applications. In a few minutes, you get an Aerospike cluster deployed to your specifications. Each node is configured with Aerospike Server Community Edition and Aerospike Management Console. The available tuning parameters can be found in the Click to Deploy Aerospike documentation.

In addition to the rapid deployment provided by Click to Deploy, we are also excited by the results we are seeing in our performance testing on Google Cloud Platform. Back in 2009, the founders of Aerospike saw that SSDs would be the future of storage, offering data persistence with better read/write access-times than rotational hard disks, greater capacity than RAM and a price/performance ratio that would fuel the development of applications that were previously not economically viable to run. The current proliferation of SSDs, now available on Google Compute Engine, validates this vision and this unprecedented level of price/performance will enable a new category of real-time data intensive applications.

In this post, we will showcase the performance characteristics of Local SSDs on Google Compute Engine and demonstrate RAM-like performance with 15x storage cost advantage using Local SSDs. We repeated recent tests published in “Aerospike Hits 1 Million Writes Per Second With Just 50 Nodes,” using Local SSDs instead of RAM.

Aerospike certifies Local SSDs on Google Compute Engine
When the first Aerospike customers deployed the Aerospike database in 2010, there was no way to benchmark SSDs. The standard fio (Flexible IO) tool for benchmarking disks did not fit our needs, so Aerospike developed and open sourced the Aerospike Certification Tool (ACT) for SSDs. This tool simulates typical database workloads:

  • Reads small objects (default 1500 bytes) using multiple threads (default 16).
  • Writes large blocks (default 128KB) to simulate a buffered write mechanism in DBMS.
  • Reads large blocks (default 128KB) to simulate typical background processing.

ACT is used to test SSDs from different manufacturers, understand their characteristics and select configuration values that maximize the performance of each model. The test is run for 24-48 hours because the characteristics of an SSD change over time, especially in the initial few hours. In addition, different SSDs handle garbage collection differently, resulting in a wide variability in performance. To help customers select drives that pass our performance criteria, based on results of ACT, Aerospike certifies and publishes this list of recommended SSDs.

Aerospike Certification Tools (ACT) for SSDs Setup
The following server and storage configurations were used to run the ACT test:

  • Machine: n1-standard-4 with 1 Local SSD provisioned (4 vCPU, 15 GB memory)
  • SSD size: 375GB
  • Read/Write size: 1500 bytes (all reads hit disk, but writes are buffered)
  • Large block read size: 128KB
  • Load: 6000 reads/s, 3000 writes/s, 71 large block reads per sec

ACT results show that 95% of Local SSD reads complete in under 1 ms
The results are shown in the graph below. The y axis shows the percentage of database read transactions that take longer than 1, 2, 4, or 8 milliseconds to complete. The x axis shows how performance changes during the first few hours and how consistent performance is as the benchmark continues to run for 24 hours.
The graph shows that after the first few hours, 95% of reads complete in under 1 ms.




  • only 5% take > 1 ms
  • only 3% take > 2 ms
  • only 1% take > 4 ms
  • a negligible number take > 8 ms



(Note: % of reads >1ms is a superset of % of reads >2ms which is a superset of % of reads >4ms and so on.)




Similar to other SSDs that Aerospike has tested, the performance of Local SSDs in Google Compute Engine starts out very high and, as with normal SSD characteristics, decreases slightly over time. Performance stabilizes quickly, in about 10 hours, which based on our experience benchmarking numerous SSDs, is very good.

Comparing Aerospike performance on Local SSDs vs. RAM
An earlier post showed how Aerospike hit 1 million writes per second with just 50 nodes on Google Compute Engine and 1 million reads per second with just 10 nodes running in RAM. Aerospike’s disk storage layer was designed to take advantage of SSDs, keeping in mind their unique characteristics. For this blog post, we repeated the performance test with 10 nodes, using Local SSDs instead of RAM, which yielded the following results:

  • 15x price advantage in storage costs with Local SSDs vs RAM
  • Achieved roughly the same write throughput using Local SSDs compared to RAM
  • Achieved half the read throughput using Local SSDs compared to RAM

Aerospike delivers 15x storage cost advantage with Local SSDs vs. RAM
The table below shows the hardware specifications of the machines used in our testing. Using Local SSDs instead of RAM, we got 25x more capacity (750GB/30GB) at 1.64x the cost ($417.50/$254), for a 15x price advantage ($8.46/$0.56). We used 20 clients of type n1-highcpu-8.


Aerospike demonstrates RAM-like Latencies for Local SSDs vs. RAM
The graph below shows the percentage of reads >1ms and writes >8ms, for a number of read-write workloads.
Write latencies for Local SSDs are similar to RAM because in both cases, writes are first written in memory and then flushed to disk. Although read latencies are higher with Local SSDs, the differences are not noticeable here because most reads using Local SSDs finish under 1ms and the percentage of reads taking more than 1ms is similar for both RAM and Local SSDs.

Aerospike demonstrates RAM-like Throughput for Writes on Local SSDs vs. RAM
The graph below compares throughput for different Read-Write workloads. The results show:
  • 1.0x write throughput (while doing 100% writes) using Local SSDs compared to RAM. Aerospike is able to achieve the same write throughput because of buffered writes, where writes are first written in memory and subsequently flushed to disk.
  • 0.5x read throughput (while doing 100% reads) using Local SSDs compared to RAM. Aerospike is able to achieve such high performance using Local SSDs because it stores indexes in RAM and they point to data on disk. The disk is accessed exactly once per read operation, resulting in highly predictable performance.

Surprisingly, when doing 100% reads with Local SSDs, over 55% complete in under 1 ms. Most reads to SSDs may take 0.5-1ms while reads in RAM may take < 0.5ms. That may be why there is drop in read throughput without a corresponding drop in the latencies  > 1ms.

Summary
This post documented results of the Aerospike Certification Test (ACT) for SSDs and demonstrated a 15x storage cost advantage and RAM-like performance with Local SSDs vs. RAM. This game changing price/performance ratio will power a new category of applications that analyse behavior, anticipate the future, engage users and monetize real-time big data driven opportunities across the Internet.

You can Deploy an Aerospike cluster today by taking advantage of the Google Cloud Platform free trial with support for Standard Persistent Disk and SSD Persistent Disk.

-Posted by Sunil Sayyaparaju, Director of Product and Technology at Aerospike

Aerospike is the registered trademark of Aerospike, Inc.. All other trademarks cited here are the property of their respective owners.

Posted:
Local SSD has shown fantastic price-performance throughout beta, and today we are excited to announce that it's now generally available to Google Compute Engine customers in all regions around the world.

The Local SSD feature lets customers attach between 1 and 4 SSD partitions of 375 GB to any full core VM and have dedicated use of those partitions. And, it provides an extremely high number of IOPS (680k random 4K read IOPS). Unlike Persistent Disk, Local SSD doesn’t have redundancy. This is ideal for highly demanding applications that provide their own replication (such as many modern databases and Hadoop), as well as for scratch space for intense computational applications. Local SSD is also a great supplement for memory due to high IOPS and low price per GB.

Local SSD is competitively priced at $0.218/GB/month. For those used to buying Local SSD attached to VMs, this comes to $0.0003/GB/hour. If re-calculated as price per-IOPS, it comes to the extremely low price of $0.00048/read-IOPS/month.

We’ve had a fantastic response to the beta of Local SSD. For example, Cloud Harmony has shown that Google Compute Engine Local SSD achieves the highest number of 4K random IOPS across all of the VM-storage combinations tested. And it’s even better if normalized for price!

In addition, we’ve already received positive feedback from our customers. Aerospike achieved RAM-like performance and 15x cost advantage with the use of Local SSD as a supplement for RAM on their NoSQL database. And Gyazo.com is using Local SSD as a supplement for RAM for MongoDB, which works out perfectly due to low price and high performance of Local SSD. Isshu Rakusai from Gyazo.com said: “Local SSD on Google Compute Engine is very fast. It has allowed us to decrease our RAM usage, and made us more cost efficient.”

Learn more in the local SSD beta announcement or product documentation. And try our top IOPS with your VM.

- Posted by Kirill Tropin, Product Manager

Posted:
Big data processing can take place in many contexts. Sometimes you’re prototyping new pipelines, and at other times you’re deploying them to run at scale. Sometimes you’re working on-premises, and at other times you’re in the cloud. Sometimes you care most about speed of execution, and at other times you want to optimize for the lowest possible processing cost. The best deployment option often depends on this context. It also changes over time; new data processing engines become available, each optimized for specific needs — from the venerable Hadoop MapReduce to Storm, Spark, Tez or Flink, all in open source, as well as cloud-native services. Today’s optimal choice of big data runtime might not be tomorrow’s.

But in all these cases, what remains true is that you need an easy-to-use, powerful and flexible programming model that makes developers productive. And no one wants to have to rewrite their algorithm for a specific runtime.

We believe the Dataflow programming model, based on years of experience at Google, can provide maximum developer productivity and seamless portability. That's why in December we open sourced the Cloud Dataflow SDK, which offers a set of primitives for large-scale distributed computing, including rich semantics for stream processing. This allows the same program to execute either in stream or batch mode.

Today, we’re taking the next step in ensuring the portability of the Dataflow programming model by working with Cloudera to make Dataflow run on Spark. There are currently three runners available to allow Dataflow programs to execute in different environments:

  • Direct Pipeline: The “Direct Pipeline” runner executes the program on the local machine.
  • Google Cloud Dataflow: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can apply here.
  • Spark: Thanks to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the Cloudera Labs effort and is available in this GitHub repo. You can find out more about Dataflow and the Spark runner from Cloudera’s Josh Wills in this blog post.

We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem. We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises. Please join us – whether by using the Dataflow SDK (deploying via one of the three runners listed above) for your own data processing pipelines, or by creating a new Dataflow runner for your favorite big data runtime.

-Posted by William Vambenepe, Product Manager