Going with the stream: unbounded data processing with Apache Flink

Going with the stream: unbounded data processing with Apache Flink

Previously, we introduced streaming, saw some of the benefits it can bring and discussed some of the architectural options and vendors / engines that can support streaming-oriented solutions. We now focus on one of the key players in this space, Apache Flink, and the commercial entity that employs many of the Flink committers and provides Flink-related services, data Artisans (dA).

We talked with dA CEO and Flink PMC member, Kostas Tzoumas. Tzoumas, who has a solid engineering background and was one of the co-creators of Flink, was keen to elaborate on an array of topics: from the streaming paradigm itself and its significance for applications, to Flink's latest release and roadmap and dA's commercial offering and plans.

A few people, including Tzoumas, have made the case for seeing traditional, bounded data and processing as a special case of its unbounded counterparts. While this may seem like a theoretical construct, its implications can be far-reaching.

As Dean Wampler, author ofFast Data Architectures for Streaming Applications argues, "if everything is considered a "stream" -- either finite (as in batch processing) or unbounded -- then the same infrastructure doesn't just unify the batch and speed layers, but batch processing becomes a subset of stream processing."

This as a paradigm shift, argues Tzoumas, as it means that the database is no longer the keeper of the global truth.

The global truth is in the stream: an always-on, immutable flow of data that is processed by an unbounded processing engine. State becomes a view on that unbounded data, specific to each application and kept locally utilizing whatever storage makes sense for the application.

In this architecture, applications consume streams of unbounded data, but they also use streams to publish their own data that may in turn be consumed by other applications. So the streaming engine becomes the hub of the entire data ecosystem. According to Tzoumas:

"What most people think of when it comes to streaming is applications like real-time analytics or IoT. We do believe that these are super-important and we fully support them, however what unbounded processing has the potential to do is offer a new pathway for all applications whose nature fits the streaming model, and those go way beyond the typical examples one would think of. Basically, these are all applications that periodically update their data. They may not necessarily be real time -- they may have latency that goes into the hours range, but that's not the point. as long as there is an inflow of data, we see them as streaming data applications. These are operational, not analytics applications. So for me streaming is not in any way confined to the analytics world, or the Hadoop world for that matter."

Tzoumas further says that:

Tzoumas offers two reasons for this:

1) Time management. You can manage time correctly (by using event time and watermarks), so you can group records correctly, based on when an event occurred, and not just artificially, based on when an event was ingested or processed (which is very often wrong).

2) State management. By modeling your problem as a streaming problem, you can keep state across boundaries. So two events that arrive in different time intervals but still belong to the same logical group can still be correlated with each other.

This may sound compelling, but why go for Flink when there are so many alternatives out there?

Why would anyone choose Flink over Spark in specific, which is enjoying wide popularity and vendor support? After all, if it's latency you're worried about, as Tom Reilly, Cloudera's CEO put it, "we're able to offer sub-second responses and we don't hear any complaints from customers."

Spark is a really good community, but Spark as a platform has some fundamental problems when it comes to streaming, argues Tzoumas.

"It's not about sub-second responses, it's about how to approach a continuous application that needs to keep state.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Artificial or Augmented Intelligence: Talks with Intel’s Chief Data Scientist, Bob Rogers

27 Sep, 2018

I recently sat down with Bob Rogers. Bob is Intel’s Chief Data Scientist for Analytics and AI. I sought out answers …

Read more

What is the Team Data Science Process?

24 Jul, 2022

The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent …

Read more

Data Management Governance Demystified: Achieving Data-Driven Success

17 Feb, 2024

Unlock data-driven success with effective data management governance. Discover key components, implementation strategies, and future trends.

Read more

Recent Jobs

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

D365 Business Analyst

South Bend, IN, USA

22 Apr, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.