5 Questions to Ask When Building a Cloud Data Lake Strategy

5 Questions to Ask When Building a Cloud Data Lake Strategy

Shiyi Gu is the Product Marketing Manager for Big Data at Talend. Shiyi brings her expertise in Data Integration, Big Data and NoSQL, and is passionate about open source technologies. She loves helping customers connect the dots between technology and business value.

In my last blog post, I shared some thoughts on the common pitfalls when building a Data Lake. As the movement to the cloud gets more and more common, I’d like to further discuss some of the best practices when building a cloud Data Lake strategy. When going beyond the scope of integration tools or platforms for your cloud data lake, here are 5 questions to ask, that can be used as a checklist:

As many differences as there are between the two, people often times compare the two types of technology approaches. Data warehouses being the centralization of structured data, and Data Lakes often times being the holy grail of all types of data. (You can read more about the two approaches here.)

Not to confuse the two, as these technology approaches should actually be brought together. You will need a data lake to accommodate all types of data that your business deal with today, make it structured, semi-structured or unstructured, on-premise or in the cloud, or those newer types of data such as IoT data. The data lake often time has a landing zone and staging zone for raw data – data at this stage are not yet consumable, but you may want to keep them for future discovery or data science projects. On the other hand, a cloud Data warehouse will be in the picture after data is cleansed, mapped and transformed, so that it is more consumable for business analysts to access and make the use of data for reporting or other analytical use. Data at this stage is often time highly processed to adjust to the Data warehouse.

If your approach currently only works with a cloud data warehouse, then often time you are losing raw and some formats of data already, it is not so helpful for any prescriptive or advanced analytics projects, or Machine Learning and AI initiatives as some meanings within the data is already lost. Vice versa, if you don’t have a data warehouse alongside with your data lake strategy, you will end up with a data swamp where all data is kept with no structure, and not consumable by analysts.

From the integration perspective, make sure your integration tool work with both data lake and data warehouse technologies, which will lead us to the next question. 

As much as you may know about ETL in your current on-premises data warehouse, moving it to the cloud is a different story, not to mention in a cloud data lake context. Where and how data is processed really depends on what you need for your business.

Similar to what we described in the first question, sometimes you need to keep more of the raw nature of the data, and other times you need more processing. This would require your integration tool to cope with both ETL and ELT capabilities, where the data transformation can be handled either before the data is loaded to your final target, e.g. a cloud data warehouse, or after data is landed there. ELT is more often leveraged when the speed of data ingestion is key to your project, or when you want to keep more intel about your data. Typically, cloud data lakes have a raw data store, then a refined (or transformed) data store. Data scientists, for example, prefer to access the raw data, whereas business users would like the normalized data for business intelligence.

Another use of ELT refers to the massive parallel processing capabilities coming with big data technologies such as Spark and Flink. If your use case requires such a strong processing power, then ELT is a better choice where the processing has more scalability.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

5 Industries That Artificial Intelligence and Machine Learning Are Transforming

25 Dec, 2017

Ever since the field of artificial intelligence (AI) research was first established as an academic discipline in the 1950s, scientists …

Read more

How restaurants will use artificial intelligence to boost sales

11 Nov, 2018

Tracking technology can help restaurants plan efficiencies around staffing and serving, as well as help them market their restaurant effectively. …

Read more

Marketing owns the Customer Data! Does it?

26 Feb, 2017

The customer, the elusive entity that every business is about – or at least should be about. The customer gets …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.