Distributed Stream Processing Frameworks for Fast & Big Data

26.3.2017 | 10 minutes reading time

Spark Streaming, Flink, Storm, Kafka Streams – that are only the most popular candidates of an ever growing range of frameworks for processing streaming data at high scale. This article is about the main concepts behind these frameworks. Furthermore the three Apache projects Spark Streaming, Flink and Kafka Streams are briefly classified.

Why Stream Processing?

The processing of streaming data is gaining in importance due to the steadily growing number of data sources that continuously produce and offer data. In addition to the omnipresent Internet of Things, these include, for example, click streams, data in the advertising business, as well as device and server logs.

Infinite and continuous data is not a new phenomenon. Even now, many data correspond to this scheme. For example, changes to master data occur continuously, but only at low frequency. Master data is processed according to the classic request / response pattern. In the case of non-time-critical changes or larger volumes, the data are often stored collectively and then processed regularly by batch processes. These then run, for example, every night or at shorter intervals.

However, daily intervals are often not sufficient. Speed is needed: analyzes and evaluations are expected promptly and not minutes or even hours later. At this point stream processing comes into play: data is processed as soon as they are known to the system. This has started with the lambda architecture (cf. [1]), in which the stream and batch processing takes place in parallel, since the stream processing could not guarantee consistent results. With today’s systems, it is also possible to achieve consistent results in almost real-time with streaming processing only (cf. [2]).

Time Matters

An important aspect of streaming is the time. Essentially three different times can be distinguished:

Event time: Time at which the event actually occurred
Ingestion time: Time at which the event was observed in the system
Processing time: Time at which the event was processed by the system

Abb. 1: Exemplary representation of event time and processing time. With late (yellow, green, red) and out-of-order events (blue)

In practice, the event time is particularly interesting compared to the ingestion and processing time. The difference between the event time and the processing time can vary greatly. The reasons are numerous: network latencies, distributed systems, hardware failures or even irregular data delivery. When being processed by the processing time, this is not important. The data is analyzed based on the system time of the processor: if an event arrives at 12 o’clock, it is irrelevant that it has already occurred at 11 o’clock.

But this is not the normal use case: If an event occurs at 11 o’clock I would like to treat it in the time it occurred. The question here is: When do I know that I got all the events until 11am? How long do I wait for events? There are several strategies and concepts to solve those problems. On of them is the Dataflow/Beam Model. Here, concepts such as watermarks, triggers and accumulators help:

Watermarks: When did I collected all the data?
Trigger: When should I trigger the calculation?
Accumulation: How do I merge individual calculations, for example when data is subsequently added.

It is easy to write a separate article about these three concepts. Tyler Akidau, the head behind streaming on Google, has already summed this up. Therefore it is recommended to read his article for details [3].

State & Window

Any non-trivial application will correlate incoming events with each other. This requires a state in which previous events are stored temporarily. This state can be stored indefinitely or explicitly limited in time. An example of an infinite stored state is a lookup table with metadata. A temporally limited state is, for example, a window.

A window is used to aggregate and analyze data for a specific period of time. This is necessary in almost every application, since the data stream never ends. There are different types of Windows.

Tumbling Window: Non-overlapping, fixed time segments
Sliding Window: Overlapping, fixed time segments
Session Window: Non-overlapping time segments of different length. Defined by certain events or by exceeding a certain time between two events

Abb. 2: Tumbling and sliding window with a time window of 4 seconds and a sliding interval of 2 seconds with the sliding window. Within each window the values are summed.

Abb. 3: Session windows with an inactivity of at least two minutes between two events for a key.

For the definition of windows, the distinction between event and processing time is important: windows based on processing time are very simple to implement; windows based on event time need the above event time strategies, in order not to grow infinitely.

API & Runtime Environment

First differences in the frameworks can be found within the API and the general processing model. Differentiating between a native streaming approach and microbatching. In native streaming, incoming data is processed directly while microbatching collects the incoming data for a certain time (typically 1 – 30s) and then processes it together. The next microbatch can then be started either directly after the completion of the previous batch, or only after the fixed interval has elapsed. In both cases, microbatching increases latency, but the handling of errors is somewhat easier. The frequently mentioned advantages of the very high throughput can now also be achieved by native streaming frameworks. They also offer more flexibility for windows and states.

Visible to the developer is mainly the API. Here, too, a distinction can be made between two variants: a component-based and a declarative, high-level API. For the former, the flow is described by several components (source -> processing 1 -> processing 2 -> sink), the latter describes the operations on data (map, filter, reduce) similar to Scala Collections or Java 8 streams . The description of components provides more flexibility in the distribution of data streams, while the declarative API often already provides higher-order functions and automatic optimization.

Finally, the question is: Where are the applications running? One can distinguish between two – surprise 🙂 – basic alternatives. Some frameworks need a special cluster consisting of master nodes and worker nodes. These clusters then also deal with resource management and error handling, but can also outsource this to other tools (for example, YARN or Mesos). Other frameworks come as a simple library, which can be integrated into your own application. Running and scaling the application must then be taken over by other tools. Here you have the full flexibility from running a jar file via docker up to Mesos or YARN.

Distributed systems are unreliable!

All three frameworks are specialized in processing large amounts of data and solve this by horizontal scaling. These distributed systems are inherently unreliable: single nodes can fail, the network is inconsistent, or the database in which the results are to be written is unavailable.

For this reason, each framework has different mechanisms to achieve certain guarantees. These range from microbatching, in which small batches are repeated, via acknowledgments for individual data sets, to transactional updates on source and sink. The guarantees achieved are then usually at-least-once or exactly-once. Since exactly-once is often difficult to achieve, at-least-once guarantees with idempotent operations are often sufficient in terms of both speed and error tolerance.

Isn’t there something that can help us?

Time handling, state & windows, a runtime environment, all in a distributed fashion: streaming applications are complex. There are a number of projects to help with these problems. Three of them briefly presented:

Apache Spark (Streaming)
Apache Spark is currently one of the most popular projects in the streaming field. Started as a better MapReduce, support for streaming data was added later. Spark streaming relies on microbatching with a declarative API. At the moment, only the processing time is fully supported, but with the new Structured Streaming API the support for event time processing has also been gradually expanded since version 2.0. The same is true for supporting windows. The state is stored locally in memory or on disk and is regularly backed up by checkpointing. Since Spark is now distributed with every major Hadoop distro, the overall distribution is very high. There is also a large ecosystem with many tools and connectors.

Apache Flink
When it comes to event-time processing, Apache Flink is currently the first choice. Watermarks and triggers are supported as well as different window operations. Flink pursues a native streaming approach and thus achieves low latencies. As with Spark Streaming it offers a declarative API, with the possibility to use so-called rich functions, in which, for example, a state can be utilized. Unlike Spark, the state implementations can be chosen from different implementations: in-memory, hard disk or RocksDB. Flink is slightly younger than Spark, but is gaining in popularity. Likewise the community and the ecosystem is growing steadily, but is not yet as big as with Spark.

Apache Kafka Streams
The streaming framework from the Kafka ecosystem is the latest representative in this overview. It is based on many concepts already contained in Kafka, such as scaling by partitioning the topics. Also for this reason it comes as a lightweight library, which can be integrated into an application. The application can then be operated as desired: standalone, in an application server, as docker container or via a resource manager such as mesos. Flink & Spark, on the other hand, always need a cluster, either built with the equipment of the frameworks or YARN / mesos. Kafka Streams, however, is limited to Kafka as a source and also as a sink. But you can connect a Kafka topic to other systems through Kafka Connect, with over 60 available connectors. Apart from a declaratory, Kafka also has a component-oriented API, a rudimentary support for event time, and RocksDB as a state implementation. While Kafka is already very mature and often used in connection with Flink and Spark, the streaming component is still quite young. So the community and the spread is rather small. It is, however, to be expected that both will grow rapidly.

Update:

It should be noted that Kafka Streams does not use the concepts of the Beam Model to tackle the challenges of event time processing. Streams is built on the concept of KTables and KStreams , which helps them to provide event time processing.

And what suits me?

Finally, the question is: Which framework suits me? If event-time processing is required and you do not mind working with the concepts of the Beam Model, you could go with Apache Flink. Another advantage is the low latency. The most important systems (Kafka, Cassandra, Elasticsearch, SQL databases) can be integrated relatively easily.

The low latency and an easy to use event time support also apply to Kafka streams. So if Kafka is already in use ~~and the processing is rather simple, without complex requirements for event processing~~ (Streams can also be used for more complex stream processing), Kafka Streams is a good alternative. For this you have to connect the other systems, like databases, via Kafka Connect and care about the runtime environment. This can also be an advantage if I can use existing tools, for example from the Docker ecosystem.

And Spark? If event time is not relevant and latencies in the seconds range are acceptable, Spark is the first choice. It is stable and almost any type of system can be easily integrated. In addition it comes with every Hadoop distribution. Furthermore the code used for batch applications can also be used for the streaming applications as the API is the same.

Only with very large states Spark can cause problems. The support for event time is expanded with Spark 2.1.

Conclusion

Stream processing frameworks significantly simplify the processing of large amounts of data. The presented frameworks primarily solve problems in the area of distributed processing, whereby easy-to-scale solutions can be developed. Equally important are the different aspects of the time processing, which all frameworks support in some way.

That is what distinguishes those systems from libraries such as Akka Streams, RxJava, or Vert.x. The presented frameworks are mainly located in the Big and Fast Data area, while the libraries can also be used to build smaller, more reactive applications, but usually without native support for event time and clustering.

It remains to be noted that the presented framework can all help with current challenges in the fast data area and also support new architectures beyond the well-known lambda architecture. However, the complexity of these distributed systems is in no way to be underestimated. Nevertheless, it is to be assumed that the spread of the systems as well as the functionality will continue to grow.

Links

[1] http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

[2] https://www.oreilly.com/ideas/questioning-the-lambda-architecture

[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

Was this post helpful?

Blog author

Matthias Niehoff

Head of Data

Do you still have questions? Just send me a message.

fromMatthias Niehoff

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 4 minutes reading time

Matthias Niehoff

Lookup additional data in Spark Streaming

When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event. In this blog post I would like to discuss...

Software architecture
Scala
Big Data
Data
Streaming

1.6.2017 | 7 minutes reading time

Matthias Niehoff

Event time processing in Apache Spark and Apache Flink

With the new release of Spark 2.1, the event-time capabilities of Spark Structured Streaming have been expanded. It is time to take a closer look at the state of support and compare it with Apache Flink – which comes with a broad support for event time...

Big Data
Data
Machine Learning
Streaming

19.4.2017 | 9 minutes reading time

Matthias Niehoff

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Exploring Dapr: A Deep Dive into Distributed Application Runtime

In a recent blog post, we introduced Dapr (Distributed Application Runtime) and highlighted its potential as a valuable tool for cloud-native applications, in combination with Aspire. This post dives deeper into the inner workings of Dapr, explaining...

Software development
Cloud native
Software architecture
Open Source

10.7.2024 | 10 minutes reading time

Manuel Zapf

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

I recall the days when writing a web application in C# with .NET meant deploying it on an IIS web server for accessibility. Today, this approach seems outdated, especially with the shift towards microservice-based architectures. Fortunately, Microsoft...

Software architecture
Open Source
Cloud
Microservices
Infrastructure as Code
.NET
Cloud native

27.6.2024 | 8 minutes reading time

Manuel Zapf

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

When Gateway API debuted in October last year, this concluded a nearly four-year-long process that started in summer 2019. Gateway API is the successor of core Ingress definition, aiming towards various goals. This blog post will give a brief overview...

API
Open Source
Cloud
Networking
Kubernetes
Cloud native

15.3.2024 | 6 minutes reading time

Manuel Zapf

How to gain visibility as a software developer?

No matter if junior, medior or senior, introverted or extroverted: Every software developer can increase their visibility with different tools and should treat the topic as important. The only question is: how and with what effort? In this blog post,...

Training
Software development
Community
Open Source

21.2.2024 | 6 minutes reading time

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

Building desktop apps with web technologies

Building desktop apps with web technologies In this article I share insights into Electron and what to consider when shipping an desktop app with Electron. After that I introduce you to a new alternative called Tauri. It the end I provide an estimation...

Frontend
JavaScript
Node.js
Open Source
Webdevelopment

20.9.2023 | 13 minutes reading time

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Distributed Stream Processing Frameworks for Fast & Big Data

Why Stream Processing?

Time Matters

State & Window

API & Runtime Environment

Distributed systems are unreliable!

Isn’t there something that can help us?

And what suits me?

Conclusion

Links

Was this post helpful?

Blog author

More articles

Access Databricks UnityCatalog from duckdb

Lookup additional data in Spark Streaming

Event time processing in Apache Spark and Apache Flink

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Exploring Dapr: A Deep Dive into Distributed Application Runtime

Modern Microservices: Unleashing the Power of .NET Core, Aspire, and Dapr

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

Demystifying the Kubernetes Gateway API: What the heck is it and why should...

How to gain visibility as a software developer?

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

Building desktop apps with web technologies

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals