Crossing the Streams – Joins in Apache Kafka

15.2.2017 | 14 minutes reading time

Version 0.10.0 of the popular distributed streaming platform Apache Kafka saw the introduction of Kafka Streams. In its initial release, the Streams-API enabled stateful and stateless Kafka-to-Kafka message processing using concepts such as map, flatMap, filter or groupBy that many developers are familiar with these days. In Kafka 0.10.1, it started to support “Interactive Queries”, an API that allows querying stateful stream transformations without going through another Kafka topic.

In this article, we will talk about a specific kind of streaming operation – the joining of streams. We will begin with a brief walkthrough of some core concepts. Then we will take a look at the kinds of joins that the Streams API permits. Following that, we’ll walk through each possible join by looking at the output of an established example. At the end, you should be aware of what kinds of joins are possible in Kafka Streams and where the caveats lie.

A brief introduction to some core concepts

The central component of Kafka is a distributed pub-sub message broker where producers send messages – key-value pairs – to topics which in turn are polled and read by consumers. Each topic is partitioned and the partitions are distributed among brokers. The excellent Kafka documentation explains it best.

There are two main abstractions in the Streams API. A KStream is a stream of key-value pairs – basically a close model of a Kafka topic. The records in a KStream either come directly from a topic or have gone through some kind of transformation – it has for example a filter-method that takes a predicate and returns another KStream that only contains those elements that satisfy the predicate. KStreams are stateless, but they allow for aggregation by turning them into the other core abstraction – KTable – which is often describe as “changelog stream”.

A KTable statefully holds the latest value for a given message key and reacts automatically to newly incoming messages.
A nice example is perhaps counting visits to a website by unique IP addresses. Let’s assume we have a Kafka topic containing messages of the following type: (key=IP, value=timestamp). A KStream contains all visits by all IPs, even if the IP is recurring. A count on such a KStream sums up all visits to a site including duplicates. A KTable on the other hand only contains the latest message and a count on the KTable represents the number of distinct IP addresses that visited the site.

KTables and KStreams can also be windowed. Regarding the example, this means we could add a time dimensions to our stateful operations. To enable windowing, Kafka 0.10 changed the Kafka message format to include a timestamp. This timestamp can either be CreateTime or AppendTime. CreateTime is set by the producer and can be be set manually or automatically. AppendTime is the time a message is appended to the log by the broker. Next up: joins.

Joins

Taking a leaf out of SQLs book, Kafka Streams supports three kinds of joins:

Inner Joins
- Emits an output when both input sources have records with the same key.
Outer Joins
- Emits an output for each record in either input source. If only one source contains a key, the other is null
Left Joins
- Emits an output for each record in the left or primary input source. If the other source does not have a value for a given key, it is set to null

Another important aspect to consider are the input types. The following table shows which operations are permitted between KStreams and KTables:

Primary Type	Secondary Type	Inner Join	Outer Join	Left Join
KStream	KStream	Allowed	Allowed	Allowed
KTable	KTable	Allowed	Allowed	Allowed
KStream	KTable	Not Allowed (Coming 0.10.2)	Allowed	Not Allowed

As the table shows, all joins are permitted between equal types. The only permitted inter-type join is a left join between a KStream and a KTable. A possible use case for this join is the matching of incoming streaming data to a KTable that contains some kind of master data, like matching an incoming user id to a more detailed profile.

In total there are seven possible join types. Let’s look at them in detail. Disclaimer: the table of possible joins and the join semantics are only valid for Kafka 0.10.0 and 0.10.1. They will be improved in Kafka 0.10.2 (see KIP-77 ).

Example

We are going to use a consistent example to demonstrate the differences in the joins. It is based on the online advertising domain. There is one Kafka topic that contains view events of particular ads and another one that contains click events based on those ads. Views and click share an ID that serves as the key in both topics.

In the examples, custom set event times provide a convenient way to simulate the timing within the streams. We will look at the following 7 scenarios with ids 0 to 6:

a click event arrives 1000 ms after the view
a click event arrives 10,000 ms after the view
a view event arrives 1000 ms after the view
there is a view event but no click
there is click event but no view event
there are two view events at distinct times, and a click event 1000 ms after the first view. The view events are denoted as 5.1 and 5.2
there is a view event followed by a click event after 500 ms and another one after 1200 ms. The clicks are denoted as 6.1 and 6.2 in the following.

This visualization shows these streams:

Inner Join KStream-KStream

All KStream-KStream joins are windowed, so the developer has to specify how long that window should be and if the relative order of the elements of both streams matters (ie, happens before/after semantics). The rationale behind that forced windowing is this: a KStream is stateless. To execute a join with acceptable performance, some internal state needs to be kept – otherwise the whole streams would need to be scanned each time a new element arrives. That state contains all elements of the stream within the time window (possible duplicates included). We will use a window of 5000 milliseconds in the following examples.

An inner join on two streams yields a result if a key appears in the both streams within the window. Applied to the example, this produces the following results:

Scenarios 0 and 2 appear as expected as the key appears in both streams within 5000 ms, even though they come in different order. Scenario 1 is missing as view and click did not appear within the window. Scenarios 3 and 4 are missing because they do not appear in both streams. Scenarios 5 and 6 appear duplicated as the keys appear twice in the view stream for scenario 5 and in the click stream for scenario 6.

A variation of this the enforcement of ordering. The developer can specify that a click event can only be joined if it occurs after a view event. This setting would lead to the elimination of scenario 2 in the example.

Outer Join KStream-KStream

The following will assume the idealistic ordering that each event is processed in the order of its timestamp. In practice, this is not always guaranteed.

An outer join will emit an output each time an event is processed in either stream. If the window state already contains an element with the same key in the other stream, it will apply the join method to both elements. If not, it will only apply the incoming element.

The following shows the idealistic result:

For scenario 0, an event is emitted once the view is processed. There is not click yet. When the click arrives, the joined event on view and click is emitted. For case 1, we also get two output events. However, since the events do not occur within the window, neither of these events contains both view and click. Scenario 3 appears in the output without a click, and the equivalent output is emitted for scenario 4. Scenario 5 produces 4 output events as there are two views that are emitted immediately and once again when they are joined against a click. Case 6 only produces 3 events as both clicks can be immediately joined against a view that arrived earlier.

What happens in real life? Hard to say because the timing is important. In tests, the clicks for scenario 0 were always processed earlier than the views by the streaming application, despite using CreateTime for the message timestamps. This is probably what you should take away from this paragraph: as of Kafka 0.10.1.x, outer joins work on processing time, not message time.

EDIT: The struck sentence is incorrect. Kafka Streams definitely do not work on processing time but event time. However, there are runtime dependencies and processing order matters. Thanks to Matthias Sax from Confluent for pointing that out.

Left Join KStream-KStream

This type of join is processing event-time dependent as well and has some behaviour that you might not expect due to a runtime dependency on processing order.
The left join emits an output event every time an event arrives in the left stream. If an event with the same key has previously arrived in the right stream, it is joined with the one in the primary stream. Otherwise it is set to null. With our examples, that is going to result in a bit of a sad picture as we only have one event where the click arrives before the view and thus has already been processed. This leads to the following result:

Only scenario 2 yields a complete result in the idealistic version, but as expected from a left join, all elements from the left stream show up. This is one of the semantics that will change with Kafka 0.10.2 where “right” messages arriving within the window will cause the emission of a joined event.

Inner Join KTable-KTable

Now we’re switching from KStreams to KTables. KTables are represented by materializing the incoming data into a local state store, building a table that always contains the latest update for a key. When a new record arrives, it is joined with the other table’s state store. Joins on KTables are not windowed as KTables describe a state, not a stream. The emitted events show state changes within the table.

It is not easy to figure out what happens here exactly. The following chart represents observations – please correct me if I’m wrong.
EDIT: I have gained a better insight in KTables. The behaviour that tripped me up is caused by caching. It works like this: by default, KTables use a cache for more efficient data processing. Results are only emitted once either the cache is full or a commit interval (default: 30 seconds) has been reached. For a better understanding of the join semantics, disabling caching is very helpful. The following chart shows results for both scenarios:

All the inner join pairs are emitted in both scenarios. Since we’re no longer windowed, even scenario 1 is represented. Empty brackets represent a notification that the state for the keys those scenarios has been updated, but as they could not be joined, the new state is empty/null. With caching enabled all data has been processed before a cache flush happens, so we’re missing events 5.1 and 6.1 from the join output. Another curious thing is that events are duplicated. According to KAFKA-4609 this is caused by the fact that both source tables are flushed individually and therefore both trigger the join. This seems to fit as the first set of emissions contains the empty update for scenario 3 whereas the second set contains scenario 4.

With caching disabled, cache semantics become a lot clearer. Very interesting are scenarios 5 and 6 where the stateful nature of KTables can be observed – view 5.1 is missing as it is updated in the source table by 5.2 wheres we get two joined output for scenario 6 because while click 6.2 updates 6.1, the runtime was able to match 6.1 against view 6 before it was update.

Outer Join KTable-KTable

Apart from the duplication that occurs in the outer join as well when enabling caching, it behaves pretty much as expected:

For the version with caching, the results are the same as with the inner join with the addition of proper data for scenarios 3 and 4 which for scenario 3 contains the lone view and for scenario 4 the lone click.

The none-caching variant emits an event every time a new message arrives and executes the join if a matching entry is held on the other table.

Left Join KTable-KTable

Left joins exhibit the same duplication behaviour. Apart from that, they also work the way you’d expect by now:

Using Caching we get a click element for case 3 and an empty/null event for scenario 4. The rest of the data is joined completely.
Disabling caching, we can see the semantics much more clearly. They behave as you’d expect a left join to behave.

This concludes the KTable-KTable section. The big lesson learned here is that the settings for caching play a very big role in the emission of events from a joined KTable. Yet the end result is the same – each joined KTable has the same content after completely processing the sample data, cached or non-cached. But the way we’re getting there is different.

Left Join KStream-KTable

As of Kafka <= 0.10.1.x,="" this="" is="" the="" only="" type="" of="" stream="" table="" join.="" Its="" semantics="" imply="" that="" an="" incoming="" event="" in="" a="" can="" be="" joined="" against="" table.="" The="" output="" operation="" another="" (and="" not="" table!).="" highly="" timing="" dependent="" –="" example="" with="" views="" and="" clicks="" does="" really="" work="" well="" as="" you’d="" probably="" use="" stream-table="" join="" for="" click="" adserving="" usually="" arrives="" after="" view.="" But="" any="" case,="" please="" consider="" contrieved="" example:=""

We’re simulating the output of an incoming stream joined against a table of clicks that initially is populated with events 0, 2, 4 and 6.1. Each incoming stream event is joined against that table and the result either contains both view and click or only the view if there is no click to join against. Between views 5.1 and 5.2, click event 5 is registered. While view 5.1 could not be joined against a click, this update means that view 5.2 can be joined. There is no issue with duplication as the result is a stream and not a table.

Partitioning and Parallelization

If you are familiar with Kafka consumers, you are probably aware of the concept of the consumer groups – Kafka consumption is parallelized by assigning partitions to exactly one consumer in a group of consumers that share the same group id. If you’re using defaults, Kafka itself will handle the distribution and assign the partitions to consumers. This happens in a way that the consumers have little influence over.
With simple consumers, this is quite straightforward. However, what does it mean for a Kafka stream with state and joins? Can partitions still be randomly distributed? No, they cannot. While you can run multiple instances of your streaming application and partitions will be distributed among them, there are requirements that will be checked at startup. Topics that are joined need to be copartitioned. That means that they need to have the same number of partitions. The streaming application will fail if this is not the case. Producers to the topic also need to use the same partitioner although that is something that cannot be verified by the streaming application as the partitioner is a property of the producer. For example, you will not get any join results if you send view event 0 to partition 0 and the corresponding click event to partition 1 even if both partitions are handled by the same instance of the streaming application.

Summary and Outlook

Kafka Streams is a very interesting API that can handle quite a few use cases in a scalable way. However, some join semantics are a bit weird and might be surprising to developers. An example of this is left and outer join on streams depending on the processing time of the events instead of the event time. Some of those issues are addressed by KIP-77 and are scheduled to be released with Kafka 0.10.2.
The duplicates produced by KTable joins look weird, but they can provide a lot of benefit when used with the “Interactive Query” feature introduced with Kafka 0.10.1 where state stores can be directly queried.

Kafka’s journey from Pub/Sub broker to distributed streaming platform is well underway and times are very exciting.

References

The official Streams documentation
A GitHub repository with examples for all joins
Kafka Wiki describing current and future join semantics
Confluent’s general streams documentation
First-hand experience with Kafka as part of the SMACK stack

Was this post helpful?

Blog author

Florian Troßbach

Do you still have questions? Just send me a message.

fromFlorian Troßbach

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 12 minutes reading time

Florian Troßbach

Validating Topic Configurations in Apache Kafka

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong? Why does Kafka topic configuration matter...

Messaging
Big Data

7.12.2017 | 8 minutes reading time

Florian Troßbach

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries are a fairly new feature of Apache Kafka Streams that provides programmatic access to the internal state held by a streaming application. However, the Kafka API only provides access to the state that is held locally by an instance...

Messaging
Java

20.3.2017 | 9 minutes reading time

Florian Troßbach

Interactive Queries in Apache Kafka Streams

"Databases? Where we're going we don't need databases" – Doc Brown, 1985 Well, we’re certainly not there yet, but this article is going to introduce you to a new feature of the popular streaming platform Apache Kafka that can make a dedicated external...

Messaging
Streaming

13.3.2017 | 10 minutes reading time

Florian Troßbach

Realtime Fast Data Analytics with Druid

I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving...

18.8.2016 | 13 minutes reading time

Florian Troßbach

The SMACK stack – hands on!

The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools. In the first step,...

1.5.2016 | 9 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 2

This is part 2 of a series that aims to get you started with project Jigsaw. In part 1 , we briefly talked about the definition of a module and how the Java Runtime was modularized. We then proceeded to a simple example that demonstrated how to (and ...

Java

1.12.2015 | 12 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 1

Eight years after its inception, Project Jigsaw – the modularization of the Java platform and introduction of a general module system – is on track to be included in Java 9. The target release has changed over the years from Java 7 via Java 8 to Java...

Java

24.11.2015 | 11 minutes reading time

Florian Troßbach

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Money, Money, Money - Monetization of APIs needs more than just a business...

Welcome to my blog series on the topic of my bachelor's thesis, "Real-time dashboard with distributed streaming". To summarize, it discusses the visualization of API-related data that is essential for business owners. How is this series structured? This...

API
Streaming
Data

27.10.2022 | 5 minutes reading time

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Decrypt TLS traffic to Kafka using Wireshark

Usually, debugging issues related to TLS in a Java application involves setting the debug flag -Djavax.net.debug=ALL. Unfortunately, the logs will then be flooded with debug entries, which makes troubleshooting the TLS communication difficult. Moreover...

Software development
Messaging
IT-Security

30.1.2020 | 4 minutes reading time

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Having worked with Google Cloud Platform’s Big Data Services for almost a year, I wanted to have a broader view on GCP’s capabilities. In this post, I will give you an overview of the services touched by the Coursera specialization . I have been a GCP...

Cloud
Computer Vision
Data
Machine Learning
Big Data
Google Cloud
Serverless

8.9.2019 | 6 minutes reading time

Niklas Haas

Hands-on Spark intro: Cross Join customers and products with business ...

In this blog post, I want to share my aha moments with you I had during the development of my first (Py) Spark application.We do this by an example application:Read customers, products, and transactions.Create the cross join of customers and products...

Python
Big Data
Data

5.8.2019 | 20 minutes reading time

Niklas Haas

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model!In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 minutes reading time

Window Functions in Stream Analytics

Introduction to Stream AnalyticsWhy should we talk about stream analytics? In the past decades data analytics was dominated by batch processing. Records from transactional databases were copied into analytical databases by regular extract-transform-load...

Big Data
Data
Streaming

11.10.2018 | 11 minutes reading time

How to write a Kotlin DSL – e.g. for Apache Kafka

The Kotlin language is gaining more and more attention and is being used in an increasing number of projects. One thing that Kotlin can be used for is implementing special domain-specific-languages (DSLs). The Wikipedia entry on DSL states:A domain...

Messaging
DSL
Kotlin

23.6.2018 | 10 minutes reading time

ETL with Kafka

“ETL with Kafka” is a catchy phrase that I purposely chose for this post instead of a more precise title like “Building a data pipeline with Kafka Connect”.TLDRYou don’t need to write any code for pushing data into Kafka, instead just choose your connector...

Messaging
Big Data

2.3.2018 | 4 minutes reading time

Deep Learning Workshop at codecentric AG in Solingen

Big Data – a buzz word you can find everywhere these days, from nerdy blogs to scientific research papers and even in the news. But how does Big Data Analysis work, exactly? In order to find that out, I attended the workshop on “Deep Learning with Keras...

Big Data
Data
AI
Machine Learning

6.2.2018 | 6 minutes reading time

Shirin Elsinghorst

BigchainDB – The lightweight blockchain framework [blockcentric #5]

With BigchainDB we see one of the first complete but simple blockchain frameworks. The project strives to make blockchain usable for a large number of developers and use cases without requiring special knowledge in cryptography and distributed systems...

Big Data
Blockchain

3.1.2018 | 5 minutes reading time

Validating Topic Configurations in Apache Kafka

Messaging
Big Data

7.12.2017 | 8 minutes reading time

Explore Predictive Maintenance with flexdashboard

Predictive MaintenancePredictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).A common use case for Predictive...

Big Data
Data
Machine Learning

2.11.2017 | 3 minutes reading time

Shirin Elsinghorst

Data Science for Fraud Detection

What is fraud and why is it interesting for Data Science?Fraud can be defined as “the crime of getting money by deceiving people” (Cambridge Dictionary); it is as old as humanity: whenever two parties exchange goods or conduct business, there is the ...

Big Data
Data
Machine Learning

5.9.2017 | 10 minutes reading time

Shirin Elsinghorst

Crossing the Streams – Joins in Apache Kafka

A brief introduction to some core concepts

Joins

Example

Inner Join KStream-KStream

Outer Join KStream-KStream

Left Join KStream-KStream

Inner Join KTable-KTable

Outer Join KTable-KTable

Left Join KTable-KTable

Left Join KStream-KTable

Partitioning and Parallelization

Summary and Outlook

References

Was this post helpful?

Blog author

More articles

Living on the edge: building serverless applications with Cloudflare Workers

Validating Topic Configurations in Apache Kafka

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries in Apache Kafka Streams

Realtime Fast Data Analytics with Druid

The SMACK stack – hands on!

First steps with Java 9 and Project Jigsaw – Part 2

First steps with Java 9 and Project Jigsaw – Part 1

More articles in this subject area

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Money, Money, Money - Monetization of APIs needs more than just a business...

Thinking AI means re-thinking data

From PDF data sheets to shared understanding with serverless SHACL

Decrypt TLS traffic to Kafka using Wireshark

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Hands-on Spark intro: Cross Join customers and products with business ...

Move n-gram extraction into your Keras model!

Window Functions in Stream Analytics

How to write a Kotlin DSL – e.g. for Apache Kafka

ETL with Kafka

Deep Learning Workshop at codecentric AG in Solingen

BigchainDB – The lightweight blockchain framework [blockcentric #5]

Validating Topic Configurations in Apache Kafka

Explore Predictive Maintenance with flexdashboard

Data Science for Fraud Detection