Why some Data Lakes are built to last Blog

Why some Data Lakes are built to last

by 7wData
December 31, 2016

Opinion
Why some Data Lakes are built to last
Hadoop-based Data Lakes can be game-changers, but too many are under performing. Here's a checklist to make your data lake a wild success.
CIO | Jun 10, 2016 7:08 AM PT
Email a friend
Use commas to separate multiple email addresses
From
Your message has been sent.
Sorry
There was an error emailing this page.
Credit: Thinkstock
Changing the customer conversation with predictive data analysis
Hadoop-based data lakes can be game changers: better, cheaper and faster integrated enterprise information. Knowledge workers can access data directly, where project cycles are measured in days rather than months, and business users can leverage a shared data source rather than creating stand-alone sandboxes or warehouses.
Unfortunately, more than a few data lake projects are off track. Data is going in but it’s not coming out, at least not at the pace envisioned. What’s the chokepoint? It tends to be some combination of lack of manageability, data quality and security concerns, performance unpredictability, and shortage of skilled data engineers.
What distinguishes data lakes that are “enterprise class”, i.e., the ones that are built to last and attract hundreds of users and uses? First let’s look at the features that are Table stakes, i.e., what makes a data lake a data lake. Next we will describe the capabilities that make a first class data lake, one that is built to last.
Table stakes
Hadoop – the open source software framework for distributed storage and distributed processing of very large data sets on computer clusters. The base Apache Hadoop includes contains libraries and utilities needed by other Hadoop modules, HDFS – a distributed file --system that stores data on commodity machines,a resource-management platform for managing computing, and an implementation of the MapReduce programming model for large scale data processing.
Commodity Compute Clusters – whether on premise or cloud Hadoop runs on low cost commodity servers that rack and stack and virtualize. Scaling is easy and inexpensive. The economics of open source massively parallel software combined with the low cost hardware deliver the promise of intelligent applications on truly big data.
All Data / Raw Data – The data lake design philosophy is to land and store all data in raw format from source systems. Structured enterprise data from operational systems, semi structured machine-generated and web log data, social media data, et al.
Schema’less writes – this point in particular is a break-through. Whereas traditional data warehouses are throttled by time and complexity of data modelling, data lakes land data in source format. Instead of weeks (or worse) data can be gathered and offered up in short order. Schemas are used on read, pushing that analytic or modeling work to analysts.
Open source tools – (e.g., Spark, Pig, Hive, Python, Sqoop, Flume, Map Reduce, R, Kafka, Impala, Yarn, Kite, and many more) the evolving toolkit of programming, querying, and scripting languages and frameworks for ingesting and integrating data, building analytic apps, and accessing data.
Enterprise class
If the Table Stakes listed above defines a data landing area, the following differentiate a data lake that is expansible, manageable, and industrial strength:
Defined Data and Refined Data – where data lakes contain raw data, advanced lakes contain Defined and Refined data as well. Defined Data has a schema, and that schema is registered in Hadoop’s Hcatalog. Since most data comes from source systems with structured schemas, it’s infinitely practical to leverage those.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Why some Data Lakes are built to last

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

How Chief Data Officers Can Accelerate Success

How data can change your business – if you let it

Customer data without integration is hardly data at all

Recent Jobs

IT Engineer

Data Engineer

Applications Developer

D365 Business Analyst

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Why some Data Lakes are built to last

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change