Today’s guest post comes from Martin Van Ryswyk, Vice President of Engineering at DataStax.

The cloud promises many things for database users: transparent elasticity and scalability, high availability, lower cost and much more. As customers evaluate their cloud options -- from porting a legacy RDBMS to the cloud to solutions born in the cloud -- we would like to share our experience from running more than 300+ customers’ live systems in a cloud-native way.

At DataStax, we drive Apache CassandraTM. Designed for the cloud, Cassandra is a massively scalable, open-source NoSQL database designed from the ground up to excel at serving modern online applications. Cassandra easily manages the distribution of data across multiple data centers and cloud availability zones, can add capacity to live systems without impacting your application’s availability and provides extremely fast read/write operations.

One of the advantages of Google Compute Engine is its use of Persistent Disks. When an instance is terminated, the data is still persisted and can be re-connected to a new instance. This gives great flexibility to Cassandra users. For example, you can upgrade a node to a higher CPU/Memory limit without re-replicating the data or recover from the loss of a node without having to stream all of the data from other nodes in the cluster.

DataStax and Google engineers recently collaborated on running DataStax Enterprise (DSE) 3.2 on Google Compute Engine. The goal was to understand the performance customers can expect on Google’s Persistent Disk, which recently announced new performance and pricing tiers. DataStax Enterprise supports a purely cloud-native solution and can span on-premise and cloud instances for customers wanting a hybrid solution.

Tests and results of DataStax Enterprise on Google Compute Engine
We were very interested to see how consistent the latency would be on Persistent Disks, as it represents a highly consistent storage with predictable and highly competitive pricing. Our tests started at the operational level and then moved into testing the robustness of our cluster (Cassandra ring) during failure and I/O under heavy load. All tests were run by DataStax, with Google providing configuration guidance. The resulting configuration file and methodology can be found here.

The key to consistent latency in Google Compute Engine is sizing one’s cluster so that each node stays within the throughput limits. Taking that guidance with our recommended configuration, we believe the results are readily replicable and applicable to your application. We tested three scenarios, all with positive outcomes:
  1. Operational stability of 100 nodes spread across two physical zones.
    • Objective: longevity test at 6,000 record per second (60 record/sec/node) for 72 hours.
    • Results: we saw trouble-free operation, where data tests completed without issue. Replication completed, where data streamed effortlessly across dual zones.
  2. Robustness during a reboot/failure through reconnecting Persistent Disks to an instance.
    • Objective: measure impact of terminating a node and re-connecting its disk to a new node.
    • Results: new nodes joined the Cassandra ring without having to be repaired and with no data loss (no streaming required). We did need to manage IP address changes for the new node.
  3. Push the limits of disk performance for a three node cluster.
    • Objective: measure response under load when approaching the disk throughput limit.
    • Results: Our tests showed a good distribution of latency during the tests, and 90% of the I/O write times were less than 8ms (see figures below depicting the medium latency and latency distribution). These results were while our load did not exceed the published throughput (I/O) thresholds (see caps for thresholds).
What’s next
We find Google Compute Engine and the implementation of Persistent Disks to be very promising as a platform for Cassandra. The next step in our partnership will be more extensive performance benchmarking of DataStax Enterprise. We look forward to publishing the results in a future blog post.

Figures for reference
The graph below shows median latency, a figure of merit indicating how much time it takes to satisfy a write request (in milliseconds):

The figure below depicts the distribution of latencies (ms) for write latencies. As noted above, 90% of write latencies were below 8ms, indicating the consistency of performance. The tight distribution within 1-4ms speaks to the predictability of performance.

-Contributed by Martin Van Ryswyk, Vice President of Engineering, DataStax