Not up for a data lake? Analyze in place

On the surface, data lakes seem like a smooth idea. Instead of planning and building complex integrations and carefully constructed data models for analytics, you simply copy everything by default to commodity storage running HDFS (or Gluster or the likes) -- and worry about schemas and so on when you decide what job you want to run.

Yet making a full copy of your data simply to analyze it still gives people -- particularly those in authority -- pause. The infrastructure for a data lake may cost pennies on the dollar compared to a SAN, but it’s a chunk of change nonetheless, and there are many security, synchronization, and regulatory issues to consider.

That’s why I think this year’s buzzphrase will be “analyze in place” -- that is, using distributed computing technologies to do the analysis without copying all of your data onto another storage medium. To be truthful, “analyze in place” is not a 100 percent accurate description. The idea is to analyze the data “in memory” somewhere else, but in collaboration with the existing system rather than replacing it entirely.

The core challenges for analyzing in place will be the following:

The chief benefit of analyzing in place is to avoid an incredible feat of social engineering. (My department, my data, my system, my zone of control, and my accountability, so no, you can’t have a full copy because I said so and I have the authority to say no and if I don’t I’ll slow-walk you to oblivion because I’ve outlasted kids like you for decades.) Also, you get security in context, simpler operations (not having another storage system to administer), and more.

There are plenty of good reasons to use a distributed file system -- and, frankly, SANs were a big fat farce upon us all. However, "I just want to analyze the data I already store elsewhere" may not always be one of those reasons.

There’s no way around load and latency cost. If I can analyze terabytes of data in seconds, but can move only a gigabyte at a time to my Spark cluster, then the total operation time is those seconds plus the copy time. Also, you still have to pick the data out of the source system.

We use the phrase "predicate pushdown" to mean if you don’t have an index on your RDBMS, then the time to get the data to your analytics system will be equal to the time it takes for your hair to fall out. Essentially your analytics system is passing the "where" clause along to the source system.

Right now, if you’re doing this in Spark, you need to balance the predicate pushdown (that is, network optimization) against pulling your source system over (execution costs). It’s exactly as if there is no magic and we're doing a giant query against the source system and copying it into another cluster’s memory for further analysis.

Sometimes to make this work you may have to shore up the source system. That may take time -- and by the way, it's not a sexy big data project. It may be a fatter server for the Oracle team. As a guy with a sales role for a consultancy, I get a headache from this because I hear “long sales cycle.” It may also be costlier than the lake approach; it will definitely be costlier in terms of labor. The deal is that in many cases, analyze in place is the only approach that can work.

Handling security when you have permission to use the analytics system, permission to use a source system -- and potentially multiple source systems -- is complicated. I can see the FSM-awful Kerberos rules now! Up until now, big data projects have tended to skirt around this by getting a copy and designing a new security system around it or simply “pay no attention to the flat security model we got an exception for.”

In the brave new world of analyze in place, we’ll use terms like “federated” and “multitiered” to cover up “painful” and “complex,” but there is another word: “necessary.” Our organizational zones of control exist for a reason.;

 

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

How To Build a Data Science Team Now

16 Aug, 2018

Business execs who are leading their companies down the data science track may be dismayed by the difficulty and expense …

Read more

8 Business Process Analytics Every Manager Should Know

29 Jul, 2016

Operational analytics can help businesses increase efficiency, protect their reputations, save money, and eliminate waste. This is a broad area, …

Read more

Putting customer data at the heart of your digital business

12 May, 2018

Ensuring that this is collected and used effectively benefits the bottom line and builds closer, more profitable relationships with customers. …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.