Not up for a data lake? Analyze in place Blog

Not up for a data lake? Analyze in place

by 7wData
February 20, 2016

On the surface, data lakes seem like a smooth idea. Instead of planning and building complex integrations and carefully constructed data models for analytics, you simply copy everything by default to commodity storage running HDFS (or Gluster or the likes) -- and worry about schemas and so on when you decide what job you want to run.

Yet making a full copy of your data simply to analyze it still gives people -- particularly those in authority -- pause. The infrastructure for a data lake may cost pennies on the dollar compared to a SAN, but it’s a chunk of change nonetheless, and there are many security, synchronization, and regulatory issues to consider.

That’s why I think this year’s buzzphrase will be “analyze in place” -- that is, using distributed computing technologies to do the analysis without copying all of your data onto another storage medium. To be truthful, “analyze in place” is not a 100 percent accurate description. The idea is to analyze the data “in memory” somewhere else, but in collaboration with the existing system rather than replacing it entirely.

The core challenges for analyzing in place will be the following:

The chief benefit of analyzing in place is to avoid an incredible feat of social engineering. (My department, my data, my system, my zone of control, and my accountability, so no, you can’t have a full copy because I said so and I have the authority to say no and if I don’t I’ll slow-walk you to oblivion because I’ve outlasted kids like you for decades.) Also, you get security in context, simpler operations (not having another storage system to administer), and more.

There are plenty of good reasons to use a distributed file system -- and, frankly, SANs were a big fat farce upon us all. However, "I just want to analyze the data I already store elsewhere" may not always be one of those reasons.

There’s no way around load and latency cost. If I can analyze terabytes of data in seconds, but can move only a gigabyte at a time to my Spark cluster, then the total operation time is those seconds plus the copy time. Also, you still have to pick the data out of the source system.

We use the phrase "predicate pushdown" to mean if you don’t have an index on your RDBMS, then the time to get the data to your analytics system will be equal to the time it takes for your hair to fall out. Essentially your analytics system is passing the "where" clause along to the source system.

Right now, if you’re doing this in Spark, you need to balance the predicate pushdown (that is, network optimization) against pulling your source system over (execution costs). It’s exactly as if there is no magic and we're doing a giant query against the source system and copying it into another cluster’s memory for further analysis.

Sometimes to make this work you may have to shore up the source system. That may take time -- and by the way, it's not a sexy big data project. It may be a fatter server for the Oracle team. As a guy with a sales role for a consultancy, I get a headache from this because I hear “long sales cycle.” It may also be costlier than the lake approach; it will definitely be costlier in terms of labor. The deal is that in many cases, analyze in place is the only approach that can work.

Handling security when you have permission to use the analytics system, permission to use a source system -- and potentially multiple source systems -- is complicated. I can see the FSM-awful Kerberos rules now! Up until now, big data projects have tended to skirt around this by getting a copy and designing a new security system around it or simply “pay no attention to the flat security model we got an exception for.”

In the brave new world of analyze in place, we’ll use terms like “federated” and “multitiered” to cover up “painful” and “complex,” but there is another word: “necessary.” Our organizational zones of control exist for a reason.;

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Not up for a data lake? Analyze in place

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

How To Build a Data Science Team Now

8 Business Process Analytics Every Manager Should Know

Putting customer data at the heart of your digital business

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Not up for a data lake? Analyze in place

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change