Event time processing in Apache Spark and Apache Flink

19.4.2017 | 9 minutes reading time

With the new release of Spark 2.1, the event-time capabilities of Spark Structured Streaming have been expanded. It is time to take a closer look at the state of support and compare it with Apache Flink – which comes with a broad support for event time processing. . In this article, I will describe how three basic solutions for event processing – watermarks, triggers and accumulators – work and then compare their implementation in Spark and Flink. Base of the comparison is Spark in version 2.1 and Flink in version 1.2.

For a broader comparison of the frameworks refer to Distributed Stream Processing Frameworks for Fast & Big Data . The linked article also describes further basics, which are presupposed for this comparison, for example stafeful & window stream processing as well as the difference between event and processing time.

The central problems of event processing are unsorted as well as delayed events. Events are not observed by the system exactly at the time they occur. The difference between the event time and the processing time is not constant. Especially in the case of temporal evaluations, this increases the complexity of the system and thus also the coding. If, for example, all events between 12 and 13 o’clock are to be processed (a typical window operation), multiple question arise: How to deal with late events? How long to wait for delayed entries? How to update the result in case of delayed entries? Google has presented various solutions for these problems with the Dataflow API, which has now been released in Apache Beam. Both Spark and Flink do not explicitly follow the Dataflow API, but the concepts are similar and therefore can be compared.

The concepts used in the Dataflow API are one way to tackle event-time processing problems. Kafka Streams for example takes other strategies for event-time support. More on this hopefully in a following blog post.

Watermarks

To determine up to which time events have already been processed, there are watermarks. These show – as with a water level – the so far reached “time level”. When a watermark is reached, the result of the calculation is materialized. For example, if the watermark is defined as the time of the latest event minus a fixed buffer of 30 seconds, it means “It is assumed that now all events have arrived until the time x”. x is in this case the watermark defined as time(newest event) – 30 seconds.

There are also heuristic watermarks where the buffer is dynamically adapted, for example based on empirical measurements. Thus the system could observe that events are always received at night with a clear delay. The buffer could be increased on this basis. Likewise, the expected delay could be extrapolated from the delay of the last hour. In addition, there is also the – mostly unrealistic – perfect watermark, where the event is processed directly when it occurs.

In the context of watermarks, the “allowed lateness” is often mentioned. For Spark, for example, this is the watermark buffer described above. In the Dataflow Model, however, allowed lateness is an additional period after the watermark, in which events are not ignored, but can subsequently influence the result. Therefore rules are defined – for example by means of triggers – how and how often events are to be processed that occur within allowed lateness.

Regardless of the interpretation of allowed lateness, all data for the calculation of the result must remain stored until the allowed lateness has elapsed in order to update the results. The data is then automatically deleted.

Trigger

While watermarks display the current state of the data received, triggers materialize the calculation. The watermark trigger is the easiest way to do this: it is fired as soon as the watermark reaches the end of the window – for example, 1:00 pm with a window from 12 am to 1 pm. With the watermark defined above with a fixed buffer of 30 seconds, this will be the case at 13:00:30.

Other triggers are possible:

On the processing time, for example, every 2 minutes
On the event number, for example every 100 events
At special events, for example the end of a file, a technical flush event or based on the content of the event.
A combination of the three above

Triggers are typically used to determine intermediate results before the watermark is reached, or to update the results for delayed events. It would be possible, for example, with a window of 10 minutes to trigger every minute, then again at the watermark and finally at each delayed event until the allowed lateness is reached.

Accumulation

If the result is calculated more than once by the use of triggers, the developer must define how the individual partial results are handled. There are three variants:

Discarding: With each trigger, only the new partial results are passed on and the results obtained up to this point are subsequently deleted.
Accumulating: The current results are updated and passed on each trigger. The results are not deleted.
Accumulating & Retracting: Like accumulating with the additional information about the previous result so that subsequent operations can easily correct their results. This mode is provided in the DataFlow model, but is not yet implemented by any framework.

The choice of the appropriate accumulation strategy depends strongly on the subsequent processing and the final sink. If this is able to update results, the accumulating mode can be used. If, however, every partial result must only serve once as input for the next step, the discarding mode must be used.

Support in Apache Spark Streaming

Spark does not need a special mode for event processing. Internally, nothing is different from the processing time. To define a window for the event time in Spark, you must first group it by the window

1val words = ... // streaming DataFrame with schema { timestamp: Timestamp, word: String }
2 
3// Group by window and word, calculate the count for each group
4val windowedCounts = words.groupBy(
5    window($"timestamp", "10 minutes", "5 minutes"),
6    $"word"
7).count()

This is not significantly different from a groupBy on a key, but with a time window as key. In this case it is a window of 10 minutes length with a sliding interval of 5 minutes. In addition, the entries can still be grouped by a non-technical key, in this case the “word”. With the count at the end you get a word count for a 10 minutes window.

In the example above, all data is stored indefinitely so that the result can be updated even in the case of delayed events. With watermarks this time can be limited:

1val windowedCounts = words
2   .withWatermark("timestamp", "10 minutes")
3   .groupBy(
4       window($"timestamp", "10 minutes", "5 minutes"),
5       $"word")
6   .count()

Now spark only waits 10 minutes for delayed data. The data for this window is then deleted. Currently only a watermark with a fixed allowed delay can be used. At Spark, the watermark is currently equal to the allowed lateness.

Spark currently implements two different modes to output the result:

In the case of the append mode, the result of the window is output after reaching the watermark, ie 10 minutes after the end of the time window. Early triggers, ie when reaching the end of the window, are not possible. The watermark adds an additional latency.
In the case of complete mode, the previously calculated result is output with every trigger. However, Spark currently only supports triggers based on the processing time. Time-independent or even composite triggers are currently not supported.

In conjunction with watermarks, only the append mode can be used, since all the existing data must be available for the complete mode and can not be deleted after reaching the watermark.

Spark thus offers a rudimentary support for watermarks (only fixed delay) and triggers (only on processing time). For the accumulations, the implemented output modes of Spark most likely correspond to the accumulating mode of the DataFlow API, since the complete result for a time window is always determined after expiration of the allowed delay.

Support in Apache Flink

It must first be indicated to Flink that the processing is to take place on the event time. This is done by

1env.setStreamTimeCharacteristic(TimeCharacteristic.EventTIme);

Depending on the time characteristic, for example, the different window implementations behave differently. In addition, the event time of the stream source is taken into account. If the source does not provide an event time, the event time must be manually extracted from the event using timestamp assigners. The watermark must also be defined for these events.

1stream.assignTimestampsAndWatermarks(new TimestampAndWatermarkAssigner());

The developer can completely implement the assigner himself or extend predefined implementations. In particular, different implementations are possible for watermarks:

With a fixed distance
With a dynamic but limited distance
Or based on specific events.

1class FixedWatermarkGenerator extends AssignerWithPeriodicWatermarks[SomeEvent] {
2 
3   override def extractTimestamp(element: SomeEvent, previousElementTimestamp: Long): Long = {
4       element.getEventTimestamp
5   }
6 
7   override def getCurrentWatermark(): Watermark = {
8       // the watermark is 10s behind the current time
9       new Watermark(System.currentTimeMillis() - 10000)
10   }
11}

The watermark is used to determine the time when most of data for a window was processed. At this time, a calculation is executed. In addition to the watermark, an allowed lateness can be specified. This is the period of time that an event may be delayed beyond the watermark. The allowed lateness is always defined in conjunction with a window operation..

1stream
2   .assignTimestampsAndWatermarks(new TimestampAndWatermarkAssigner());
3   .keyBy(event -> event.someKey)
4   .window((SlidingEventTimeWindows.of(Time.minutes(15), Time.minutes(5)))
5   .allowedLateness(Time.minutes(2))
6   .apply()

A sliding window with a length of 15 minutes, a sliding interval of 5 minutes and an allowed lateness of 2 minutes.

When it comes to calculating a window, Flink offers additional triggers besides the watermark. These triggers can react to a certain number (for example all 100 events), to the time (either event or processing time) or to a mixture of both.. It is also possible to dynamically register triggers, which are executed in the future, for example a certain time after an event.

Last but not least the question about the accumulators: By default the complete updated data is passed on to subsequent operations within the streaming application. Instead, a fire and purge trigger can be used, in which all current data is deleted after the trigger and all other triggers only pass on the new data. This is then effectively a discarding accumulator.

As a result, Flink offers extensive support during event processing with its various watermarks and flexible triggers as well as windows. Despite the flexibility, it is possible to implement standard cases without major effort.

Summary and recommendation

It is obvious that Flink has been working much longer on support for event processing. So it is to be explained that significantly more concepts are already supported. In addition, Flink continuously works on supporting the dataflow concepts, for example with the implementation of the retractable accumulator .

Spark, on the other hand, has only started to support event processing with Structured Streaming. So far the basic principles have been created and the first concepts have been implemented with version 2.1. It is to be expected that Spark will implement the essential functions in the course of the year. But if you need stream processing with event processing now, you should start with Flink.

Was this post helpful?

Blog author

Matthias Niehoff

Head of Data

Do you still have questions? Just send me a message.

fromMatthias Niehoff

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 4 minutes reading time

Matthias Niehoff

Lookup additional data in Spark Streaming

When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event. In this blog post I would like to discuss...

Software architecture
Scala
Big Data
Data
Streaming

1.6.2017 | 7 minutes reading time

Matthias Niehoff

Distributed Stream Processing Frameworks for Fast & Big Data

Spark Streaming, Flink, Storm, Kafka Streams – that are only the most popular candidates of an ever growing range of frameworks for processing streaming data at high scale. This article is about the main concepts behind these frameworks. Furthermore...

Big Data
Data
Open Source
Messaging
Machine Learning
Streaming

26.3.2017 | 10 minutes reading time

Matthias Niehoff

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Access Databricks UnityCatalog from duckdb

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 28: Empowering application and data integration...

In today's fast-paced world, seamless application and data integration is crucial for organisational success. This blog explores how frameworks like Maslow's Pyramid, Team Topologies, Evolutionary Architectures, API Federation, and API Marketplaces, ...

API
Data
Integration

25.7.2024 | 8 minutes reading time

Daniel Kocot

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

The pillars of modern data architectures as success factors for organisations In the digital economy, a well-thought-out data architecture and the efficient use of data are crucial for organisational success. Data products, data contracts and API contracts...

Data
API

13.6.2024 | 7 minutes reading time

Daniel Kocot

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

The Digital Product Passport represents a significant shift for digital units within organisations, compelling them to ensure comprehensive data transparency. This tool not only serves as a product's digital fingerprint but also opens up new dimensions...

Data
Product management

25.1.2024 | 7 minutes reading time

Daniel Kocot

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

In the ever-evolving landscape of software development, buzzwords and paradigms come and go. One such term that has gained significant traction in recent years is "API-First Development." It's been hailed as the holy grail of modern software engineering...

API
Data

19.10.2023 | 5 minutes reading time

Daniel Kocot

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Charge your APIs Volume 13: Data meets APIOps

In the swirling digital vortex that modern businesses navigate, two things stand clear as day: our escalating reliance on Application Programming Interfaces (APIs) and the immeasurable value of data. The API Operations (APIOps) pipeline, with its automated...

API
Data

24.8.2023 | 11 minutes reading time

Daniel Kocot

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

Money, Money, Money - Monetization of APIs needs more than just a business...

Welcome to my blog series on the topic of my bachelor's thesis, "Real-time dashboard with distributed streaming". To summarize, it discusses the visualization of API-related data that is essential for business owners. How is this series structured? This...

API
Streaming
Data

27.10.2022 | 5 minutes reading time

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 minutes reading time

Denis Stalz-John

BigQuery to the rescue: How to prototype an ML system for a medium-sized...

BigQuery can help with building an ML system for production with a short time to market.Follow industry standards. Agile methods, the MLOps framework and focus on an MVP are helpful.Model improvement is not everything. A good model evaluation as well...

Data

2.2.2022 | 9 minutes reading time

Felix Medam

Event time processing in Apache Spark and Apache Flink

Watermarks

Trigger

Accumulation

Support in Apache Spark Streaming

Support in Apache Flink

Summary and recommendation

Was this post helpful?

Blog author

More articles

Access Databricks UnityCatalog from duckdb

Lookup additional data in Spark Streaming

Distributed Stream Processing Frameworks for Fast & Big Data

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Introducing Data Interface Quadrants (DIQs)

Access Databricks UnityCatalog from duckdb

Charge your APIs Volume 36 - Trends for 2025

When Business Meets Technology: From Data Product to Data Architecture...

Charge your APIs Volume 28: Empowering application and data integration...

Data for the Masses Volume 2: Data Products, Data Contracts and API Contracts

Becoming a Data-Driven Company with Applied Data Products

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Data for the Masses Volume 1: The Digital Product Passport - A Key Element...

Charge your APIs: NordicAPIs Platform Summit Edition - API first ... not...

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 13: Data meets APIOps

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Money, Money, Money - Monetization of APIs needs more than just a business...

Python on an M1 chip: Running smoothly using Docker

BigQuery to the rescue: How to prototype an ML system for a medium-sized...