5 Factors to Consider before Pouring Data in your Data Lake

5 Factors to Consider before Pouring Data in your Data Lake

As organizations move into a Big Data world, many projects will include a Data Lake component. What is a data lake, and how do we get our data into the Data Lake?

There are many different interpretations of a Data Lake, but the interpretation given by Attunity and Hortonworks’ jointly produced Data Lake Adoption and Maturity Survey Findings Report –  is a good explanation since it refers to the Data Lake as a strategy as well as an architecture.

A Data Lake is defined the data lake as an architectural strategy and an architectural destination, thus addressing both the end state architecture and establishing an adoption and transformation strategy for data architecture related decisions on the journey to the data lake 

In order to make the investment of the Data Lake, the data must get into the Data Lake somehow. If you believe the hype, organizations should simply be able to pour the data into the Lake without being concerned with joining it together. Things are not that simple, however.  One component that’s difficult for analysts to grasp is that the data storage is different, and this will impact them at the point of query. And this is without the complexities of combining Data Lake data with data from in-memory databases, for example. Data is stored in raw flat files in Hadoop’s Distributed File System (HDFS).  Moving from rectangular to non-rectangular data held in Hadoop – this is a real mind-shift for business users. People are going from a relational world to a batch processing world, and having to mix both; this is not going to be straightforward for many organizations, who already struggle with the rectangular data.

The complexity lies in getting the data into the Data Lake in the first place. A data lake is primarily a collection of data services. Raw data takes up more space, however, and it is much more difficult for analysts to navigate. In turn, this means that it is more difficult for analysts to query, and the querying will not happen at the speed of the business.  In turn, this will make it difficult to take machine scale data and make it human scale, so that it can be summarized, sliced and diced and compressed for business decision making. In order to approach these issues, the cloud offers experimentation and exploration, and is well suited to Data Lake implementations. The inclusion of cloud as the location for a Data Lake can add an additional point of complexity when getting the data into the Data Lake.

Who actually owns the data going into the Data Lake? IT retains overall responsibility for the guardianship of the data in the Data Lake, plus cataloguing the data for retrieval. The business analysts and data scientists are responsible for the usage of the Data Lake, surfing this unstructured data repository in order to answer business questions by progressively adding structure to the data. The data lifecycle of the data lake alters how the data pipeline works to get data into the Data Lake as a first step.

One key opportunity offered by Data Lakes is the ability to break down silos in the organization. The information can be amassed from various sources from different departments and from external sources, and it can be co-located in order to create a wider organizational Big Data program, with a focus on Analytics.  Many organizations have a series of dirty data puddles rather than a data lake, and this isn’t conducive to business insights driven by data.

Any strategy will need to reflect that the landscape of data is fluid and changing. These data puddleswill need to be incorporated into the Data Lake as part of the data lake creation process, andorganizations will need to review their in-house legacy systems. For larger organizations, the in-house data warehouse is a key concern, which has a more important short-term relevance. And this happens before the business have the opportunity to create queries on their data lake.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

You Already Have the Customer Data You Need. Here’s What to Do With It.

8 Nov, 2018

I’m going to tell you a secret. You already have all the data you need to make a difference in …

Read more

Ethics in AI: Why Values for Data Matter

28 Dec, 2019

While you might not consider ethics in AI a primary concern for your business, consider this: A whopping 50% of …

Read more

The role of the CIO in a digital age

20 Dec, 2016

It’s no great secret or revelation that all things digital such as e-commerce, mobile, cloud and social have revolutionised the …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.