Hortonworks unveils roadmap to make Hadoop cloud-native
- by 7wData
It would be pure understatement to say that the world has changed since Hadoop debuted just over a decade ago. Rewind the tape to 5 - 10 years ago, and if you wanted to work with Big Data, Hadoop was pretty much the only platform game in town. Open source software was the icing on the cake of cheap compute and storage infrastructure that made processing and storing petabytes of data thinkable.
Since then, storage and compute have continued to get cheaper. But so has bandwidth, as 10 GbE connections have supplanted the 1 GbE connections that were the norm a decade ago. The cloud, edge computing, smart devices, and the Internet of Things have changed the Big Data landscape, while alternatives such as dedicated Spark and AI services offer alternatives to firing up full Hadoop clusters. And as we previously noted, capping it off, cloud storage has become the de facto data lake.
Today you can run Hadoop in the cloud, but Hadoop is not currently a platform that fully exploits the capabilities of the cloud. Aside from slotting in S3 or other cloud storage in place of HDFS, Hadoop does not fully take advantage of the cloud architecture. Making Hadoop cloud-native is not a matter of buzzword compliance, but making it more fleet-footed.
The need for Hadoop to get there is not simply attributable to competition from other bespoke big data cloud services, but from the inevitability of cloud deployment. In addition to cloud-based Hadoop services from the usual suspects, we estimate that about 25% of workloads from Hadoop incumbents -- Cloudera, Hortonworks, and MapR -- are currently running in the cloud. But more importantly, by next year, we predict that half of all new big data workloads will be deployed in the cloud.
So what's it like to work with Hadoop in the cloud today? It can often take up to 20 minutes or more to provision a cluster with all the components. That flies against the expectation of being able to fire up a Spark or machine learning service within minutes -- or less. That is where containerization and microservices come in -- they can isolate workloads or entire clusters, making multi-tenancy real. And they can make it far more efficient to launch Hadoop workloads.
Another key concept for cloud operation is separating compute from storage. This actually flies in the face of Hadoop's original design pattern, where the idea was to bring compute to the data to minimize data movement. Today, the pipes have grown fat enough to make that almost a non-issue. As noted above, separate compute and storage is already standard practice with most managed cloud-based Hadoop services, although in EMR, Amazon does provide the option of running HDFS.
We're still in the early days of making Hadoop container-friendly. MapR fired the first shot with its support of persistent containers in its platform, allowing you to isolate workloads to reduce contention for resources. Hadoop 3.1 in turn now lets you launch Docker containers from YARN. But while Kubernetes will inevitably be on Hadoop's roadmap, there is no timeline yet for when it will make it into the trunk.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More