Cloud Native: What It Means in the Data World

Prior to Rockset, I spent eight years at Facebook building out their big data infrastructure and online data infrastructure. All the software we wrote was deployed in Facebook‘s private data centers, so it was not till I started building on the public cloud that I fully appreciated its true potential.
Facebook may be the very definition of a web-scale company, but getting hardware still required huge lead times and extensive capacity planning. The public cloud, in contrast, provides hardware through the simplicity of API-based provisioning. It offers, for all intents and purposes, infinite compute and storage, requested on demand and relinquished when no longer needed.
I came to a simple realization about the power of cloud economics. In the cloud, the price of using 1 CPU for 100 minutes is the same as that of using 100 CPUs for 1 minute. If a data processing task that takes 100 minutes on a single CPU could be reconfigured to run in parallel on 100 CPUs in 1 minute, then the price of computing this task would remain the same, but the speedup would be tremendous!
Recent evolutions of data processing state of the art have each sought to exploit prevailing hardware trends. Hadoop and RocksDB are two examples I’ve had the privilege of working on personally. The falling price of SATA disks in the early 2000s was one major factor for the popularity of Hadoop, because it was the only software that could cobble together petabytes of these disks to provide a large-scale storage system. Similarly, RocksDB blossomed because it leveraged the price-performance sweet spot of SSD storage. Today, the hardware platform is in flux once more, with many applications moving to the cloud. This trend towards cloud will again herald a new breed of software solutions.
The next iteration of data processing software will exploit the fluid nature of hardware in the cloud. Data workloads will grab and release compute, memory, and storage resources, as needed and when needed, to meet performance and cost requirements. But data processing software has to be reimagined and rewritten for this to become a reality.
Cloud-native data platforms should scale dynamically to make use of available cloud resources. That means a data request needs to be parallelized and the hardware required to run it instantly acquired. Once the necessary tasks are scheduled and the results returned, the platform should promptly shed the hardware resources used for that request.
Simply processing in parallel does not make a system cloud friendly. Hadoop was a parallel-processing system, but its focus was on optimizing throughput of data processed within a fixed set of pre-acquired resources. Likewise, many other pre-cloud systems, including MongoDB and Elasticsearch, were designed for a world in which the underlying hardware, on which they run, was fixed.
The industry has recently made inroads designing data platforms for the cloud, however. Qubole morphed Hadoop to be cloud friendly, while Amazon Aurora and Snowflake built cloud-optimized relational databases. Here are some architectural patterns that are common in cloud-native data processing:
Use of shared storage rather than shared-nothing storage
The previous wave of distributed data processing frameworks was built for non-cloud infrastructure and utilized shared-nothing architectures. Dr. Stonebraker has written about the advantages of shared-nothing architectures since 1986 (The Case for Shared Nothing), and the advent of HDFS in 2005 made shared-nothing architectures a widespread reality. At about the same time, other distributed software, like Cassandra, HBase, and MongoDB, which used shared-nothing storage, appeared on the market. Storage was typically JBOD, locally attached to individual machines, resulting in tightly coupled compute and storage.
But in the cloud era, object stores have become the dominant storage.


