The value of data lies in analysis -- and the intelligence one generates from it. Turning data into intelligence can be very challenging as data sets become large and distributed across disparate storage systems. Add to that the increasing demand for real-time analytics, and the barriers to extracting value from data sets becomes a huge challenge for developers.

In June 2014, we announced a significant step toward a managed service model for data processing. Aimed at relieving operational burden and enabling developers to focus on development, Google Cloud Dataflow was unveiled. We created Cloud Dataflow, which is now currently an alpha release, as a platform to democratize large scale data processing by enabling easier and more scalable access to data for data scientists, data analysts and data-centric developers. Regardless of role or goal - users can discover meaningful results from their data via simple and intuitive programing concepts, without the extra noise from managing distributed systems.

Today, we are announcing availability of the Cloud Dataflow SDK as open-source. This will make it easier for developers to integrate with our managed service while also forming the basis for porting Cloud Dataflow to other languages and execution environments.

We’ve learned a lot about how to turn data into intelligence as the original FlumeJava programming models (basis for Cloud Dataflow) have continued to evolve internally at Google. Why share this via open source? It’s so that the developer community can:
  • Spur future innovation in combining stream and batch based processing models: Reusable programming patterns are a key enabler of developer efficiency. The Cloud Dataflow SDK introduces a unified model for batch and stream data processing. Our approach to temporal based aggregations provides a rich set of windowing primitives allowing the same computations to be used with batch or stream based data sources. We will continue to innovate on new programming primitives and welcome the community to participate in this process.
  • Adapt the Dataflow programming model to other languages: As the proliferation of data grows, so do programming languages and patterns. We are currently building a Python 3 version of the SDK, to give developers even more choice and to make dataflow accessible to more applications.
  • Execute Dataflow on other service environments: Modern development - especially in the cloud - is about heterogeneous service and composition. Although we are building a massively scalable, highly reliable, strongly consistent managed service for Dataflow execution, we also embrace portability. As Storm, Spark, and the greater Hadoop family continue to mature - developers are challenged with bifurcated programming models. We hope to relieve developer fatigue and enable choice in deployment platforms by supporting execution and service portability.

We look forward to collaboratively building a system that enables distributed data processing for users from all backgrounds. We encourage developers to check out the Dataflow SDK for Java on GitHub and contribute to the community.

Interested in adding to the Cloud Dataflow conversation? Here’s how:


- Posted by Sam McVeety, Software Engineer