Hortonworks unveils roadmap to make Hadoop cloud-native

Hortonworks unveils roadmap to make Hadoop cloud-native

It would be pure understatement to say that the world has changed since Hadoop debuted just over a decade ago. Rewind the tape to 5 - 10 years ago, and if you wanted to work with Big Data, Hadoop was pretty much the only platform game in town. Open source software was the icing on the cake of cheap compute and storage infrastructure that made processing and storing petabytes of data thinkable.

Since then, storage and compute have continued to get cheaper. But so has bandwidth, as 10 GbE connections have supplanted the 1 GbE connections that were the norm a decade ago. The cloud, edge computing, smart devices, and the Internet of Things have changed the big data landscape, while alternatives such as dedicated Spark and AI services offer alternatives to firing up full Hadoop clusters. And as we previously noted, capping it off, cloud storage has become the de facto data lake.

Today you can run Hadoop in the cloud, but Hadoop is not currently a platform that fully exploits the capabilities of the cloud. Aside from slotting in S3 or other cloud storage in place of HDFS, Hadoop does not fully take advantage of the cloud architecture. Making Hadoop cloud-native is not a matter of buzzword compliance, but making it more fleet-footed.

The need for Hadoop to get there is not simply attributable to competition from other bespoke big data cloud services, but from the inevitability of cloud deployment. In addition to cloud-based Hadoop services from the usual suspects, we estimate that about 25% of workloads from Hadoop incumbents -- Cloudera, Hortonworks, and MapR -- are currently running in the cloud. But more importantly, by next year, we predict that half of all new big data workloads will be deployed in the cloud.

So what's it like to work with Hadoop in the cloud today? It can often take up to 20 minutes or more to provision a cluster with all the components. That flies against the expectation of being able to fire up a Spark or Machine Learning service within minutes -- or less. That is where containerization and microservices come in -- they can isolate workloads or entire clusters, making multi-tenancy real. And they can make it far more efficient to launch Hadoop workloads.

Another key concept for cloud operation is separating compute from storage. This actually flies in the face of Hadoop's original design pattern, where the idea was to bring compute to the data to minimize data movement. Today, the pipes have grown fat enough to make that almost a non-issue. As noted above, separate compute and storage is already standard practice with most managed cloud-based Hadoop services, although in EMR, Amazon does provide the option of running HDFS.

We're still in the early days of making Hadoop container-friendly. MapR fired the first shot with its support of persistent containers in its platform, allowing you to isolate workloads to reduce contention for resources. Hadoop 3.1 in turn now lets you launch Docker containers from YARN. But while Kubernetes will inevitably be on Hadoop's roadmap, there is no timeline yet for when it will make it into the trunk.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

EDISON Data Science Framework to define the Data Science Profession

20 Oct, 2016

EDISON Data Science Framework provides conceptual, instructional and policy components required to establish the Data Science profession. Abstract The effective …

Read more

Cloud Data Warehousing: Understanding Your Options

12 Apr, 2021

Cloud data warehouses have emerged as the go-to repositories for amassing huge amounts of data and running advanced analytics and …

Read more

Qlik Acquires Talend, Combining its Best-in-Class Data Integration, Transformation Quality and Governance capabilities

16 May, 2023

Talend and Qlik’s Data Integration and Quality solutions automate the delivery of trusted, business-ready data, enabling smarter decisions, operational efficiency, …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.