Validating Topic Configurations in Apache Kafka

7.12.2017 | 8 minutes reading time

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong?

Why does Kafka topic configuration matter?

There are three main parts that define the configuration of a Kafka topic:

Partition count
Replication factor
Technical configuration

The partition count defines the level of parallelism of the topic. For example, a partition count of 50 means that up to 50 consumer instances in a consumer group can process messages in parallel. The replication factor specifes how many copies of a partition are held in the cluster to enable failover in case of broker failure. And in the technical configuration, one can define the cleanup policy (deletion or log compaction), flushing of data to disk, maximum message size, permitting unclean leader elections and so on. For a complete list, see https://kafka.apache.org/documentation/#topicconfigs . Some of these properties are quite easy to change at runtime. For others this is a lot harder, though.

Let’s take the partition count. Increasing it upwards is easy – just run

bin/kafka-topics.sh --alter --zookeeper zk:2181 --topic mytopic --partitions 42

This might be sufficient for you. Or it might open the fiery gates of hell and break your application. The latter is the case if you depend on all messages for a given key landing on the same partition (to be handled by the same consumer in a group) or for example if you run a Kafka Streams application. If that application uses joins , the involved topics need to be copartitioned, meaning that they need to have the same partition count (and producers using the same partitioner, but that is hard to enforce). Even without joins, you don’t want messages with the same key end up in different KTables .

Changing the replication factor is serious business. It is not a case of simply saying “please increase the replication factor to x” as it is with the partition count. You need to completely reassign partitions to brokers, specifying the preferred leader and n replicas for each partition. It is your task to distribute those well across your cluster. This is no fun for anyone involved. Practical experience with this has actually led to this blog post.
The technical configuration has an impact as well. It could be for example quite essential that a topic is using compaction instead of deletion if an application depends on that. You also might find the retention time too small or too big.

The Evils of Automatic Topic Creation

In a recent project, a central team managed the Kafka cluster. This team kept a lot of default values in the broker configuration. This is mostly sensible as Kafka comes with pretty good defaults. However, one thing they kept was auto.create.topics.enable=true. This property means that whenever a client tries to write to or read from a non-existing topic, Kafka will automatically create it. Defaults for partition count and replication factor were kept at 1.

This led to the situation where the team forgot to set up a new topic manually before running producers and consumers. Kafka created that topic with default configuration. Once this was noticed, all applications were stopped and the topic deleted – only to be created again automatically seconds later, presumably because the team didn’t find all clients. “Ok”, they thought, “let’s fix it manually”. They increased the partition count to 32, only to realize that they had to provide the complete partition assignment map to fix the replication factor. Even with tool support from Kafka Manager, this didn’t give the team members a great feeling. Luckily, this was only a development cluster, so nothing really bad happened. But it was easy to conceive that this could also happen in production as there are no safeguards.

Another danger of automatic topic creation is the sensitivity to typos. Let’s face it – sometimes we all suffer from butterfingers. Even if you took all necessary care to correctly create a topic called “parameters”, you might end up with something like

Automatic topic creating means that your producer thinks everything is fine, and you’ll scratch your head as to why your consumers don’t receive any data.

Another conceivable issue is that a developer that maybe is not yet that familiar with the Producer API might confuse the String parameters in the send method

So while our developer meant to assign a random value to the message key, he accidentally set a random topic name. Every time a message is produced, Kafka creates a new topic.

So why don’t we just switch automatic topic creation off? Well, if you can: do it. Do it now! Sadly, the team didn’t have that option. But an idea was born – what would be the easiest way to at least fail fast at application startup when something is different than expected?

How to automatically check your topic configuration

In older versions of Kafka, we basically used the code called by the kafka-topics.sh script to programmatically work with topics. To create a topic for example we looked at how to use kafka.admin.CreateTopicCommand. This was definitely better than writing straight to Zookeeper because there is no need to replicate the logic of “which ZNode goes where”, but it always felt like a hack. And of course we got a dependency on the Kafka broker in our code – definitely not great.

Kafka 0.11 implemented KIP-117 , thus providing a new type of Kafka client – org.apache.kafka.clients.admin.AdminClient . This client enables users to programmatically execute admin tasks without relying on those old internal classes or even Zookeeper – all Zookeeper tasks are executed by brokers.

With AdminClient, it’s fairly easy to query the cluster for the current configuration of a topic. For example, this is the code to find out if a topic exists and what its partition count and replication factor is:

The DescribeTopicsResult contains all the info required to find out if the topic exists and how partition count and replication factor are set. It’s asynchronous, so be prepared to work with Futures to get your info.

Getting configs like cleanup.policy works similarly, but uses a different method:

Under the hood there is the same Future-based mechanism.

A first implementation attempt

If you are in a situation where your application depends on a certain configuration for the Kafka topics you use, it might make sense to fail early when something is not right. You get instant feedback and have a chance to fix the problem. Or you might at least want to emit a warning in your log. In any case, as nice as the AdminClient is, this check is not something you should have to implement yourself in every project.

Thus, the idea for a small library was born. And since naming things is hard, it’s called “Club Topicana”.

With Club Topicana, you can check your topic configuration every time you create a Kafka Producer, Consumer or Streams client.

Expectations can be expressed programmatically or configuratively. Programmatically, it uses a builder:

This basically says “I expect the topic test_topic to exist. It should also have 32 partitions and a replication factor of 3. I also expect the cleanup policy to be delete. Kafka should retain messages for at least 30 seconds.”

Another option to specify an expected configuration is YAML (parser is included):

What do you do with those expectations? The library provides factories for all Kafka clients that mirror their public constructors and additionally expects a collection of expected topic configurations. For example, creating a producer can look like this:

The last line throws a MismatchedTopicConfigException if the actual configuration does not meet expectations. The message of that exception lists the differences. It also provides access to the computed result so users can react to it in any way they want.

The code for consumers and streams clients looks similar. Examples are available on GitHub . If all standard clients are created using Club Topicana, an exception will prevent creation of a client and thus auto creation of a topic. Even if auto creation is disabled, it might be valuable to ensure that topics have the correct configuration.

There is also a Spring client. The @EnableClubTopicana annotation triggers Club Topicana to read YAML configuration and execute the checks. You can configure if you want to just log any mismatches or if you want to let the creation of the application context fail.

This is all on GitHub and available on Maven Central.

Caveats

Club Topicana will not notice when someone changes the configuration of a topic after your application has successfully started. It also of course cannot guard against other clients doing whatever on Kafka.

Summary

The configuration of your Kafka topics is an essential part of running your Kafka applications. Wrong partition count? You might not get the parallelism you need or your streams application might not even start. Wrong replication factor? Data loss is a real possibility. Wrong cleanup policy? You might lose messages that you depend on later. Sometimes, your topics might be auto-generated and come with bad defaults that you have to fix manually. With the AdminClient introduced in Kafka 0.11, it’s simple to write a library that compares actual and desired topic configurations at application startup.

Was this post helpful?

Blog author

Florian Troßbach

Do you still have questions? Just send me a message.

fromFlorian Troßbach

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 12 minutes reading time

Florian Troßbach

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries are a fairly new feature of Apache Kafka Streams that provides programmatic access to the internal state held by a streaming application. However, the Kafka API only provides access to the state that is held locally by an instance...

Messaging
Java

20.3.2017 | 9 minutes reading time

Florian Troßbach

Interactive Queries in Apache Kafka Streams

"Databases? Where we're going we don't need databases" – Doc Brown, 1985 Well, we’re certainly not there yet, but this article is going to introduce you to a new feature of the popular streaming platform Apache Kafka that can make a dedicated external...

Messaging
Streaming

13.3.2017 | 10 minutes reading time

Florian Troßbach

Crossing the Streams – Joins in Apache Kafka

Version 0.10.0 of the popular distributed streaming platform Apache Kafka saw the introduction of Kafka Streams. In its initial release, the Streams-API enabled stateful and stateless Kafka-to-Kafka message processing using concepts such as map, flatMap...

Messaging
Big Data
Streaming

15.2.2017 | 14 minutes reading time

Florian Troßbach

Realtime Fast Data Analytics with Druid

I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving...

18.8.2016 | 13 minutes reading time

Florian Troßbach

The SMACK stack – hands on!

The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools. In the first step,...

1.5.2016 | 9 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 2

This is part 2 of a series that aims to get you started with project Jigsaw. In part 1 , we briefly talked about the definition of a module and how the Java Runtime was modularized. We then proceeded to a simple example that demonstrated how to (and ...

Java

1.12.2015 | 12 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 1

Eight years after its inception, Project Jigsaw – the modularization of the Java platform and introduction of a general module system – is on track to be included in Java 9. The target release has changed over the years from Java 7 via Java 8 to Java...

Java

24.11.2015 | 11 minutes reading time

Florian Troßbach

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 [Missing String "readingTime"]

Marcel Mikl

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 [Missing String "readingTime"]

Decrypt TLS traffic to Kafka using Wireshark

Usually, debugging issues related to TLS in a Java application involves setting the debug flag -Djavax.net.debug=ALL. Unfortunately, the logs will then be flooded with debug entries, which makes troubleshooting the TLS communication difficult. Moreover...

Software development
Messaging
IT-Security

30.1.2020 | 4 [Missing String "readingTime"]

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Having worked with Google Cloud Platform’s Big Data Services for almost a year, I wanted to have a broader view on GCP’s capabilities. In this post, I will give you an overview of the services touched by the Coursera specialization . I have been a GCP...

Cloud
Computer Vision
Data
Machine Learning
Big Data
Google Cloud
Serverless

8.9.2019 | 6 [Missing String "readingTime"]

Niklas Haas

Hands-on Spark intro: Cross Join customers and products with business ...

In this blog post, I want to share my aha moments with you I had during the development of my first (Py) Spark application.We do this by an example application:Read customers, products, and transactions.Create the cross join of customers and products...

Python
Big Data
Data

5.8.2019 | 20 [Missing String "readingTime"]

Niklas Haas

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model!In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 [Missing String "readingTime"]

Window Functions in Stream Analytics

Introduction to Stream AnalyticsWhy should we talk about stream analytics? In the past decades data analytics was dominated by batch processing. Records from transactional databases were copied into analytical databases by regular extract-transform-load...

Big Data
Data
Streaming

11.10.2018 | 11 [Missing String "readingTime"]

How to write a Kotlin DSL – e.g. for Apache Kafka

The Kotlin language is gaining more and more attention and is being used in an increasing number of projects. One thing that Kotlin can be used for is implementing special domain-specific-languages (DSLs). The Wikipedia entry on DSL states:A domain...

Messaging
DSL
Kotlin

23.6.2018 | 10 [Missing String "readingTime"]

ETL with Kafka

“ETL with Kafka” is a catchy phrase that I purposely chose for this post instead of a more precise title like “Building a data pipeline with Kafka Connect”.TLDRYou don’t need to write any code for pushing data into Kafka, instead just choose your connector...

Messaging
Big Data

2.3.2018 | 4 [Missing String "readingTime"]

Deep Learning Workshop at codecentric AG in Solingen

Big Data – a buzz word you can find everywhere these days, from nerdy blogs to scientific research papers and even in the news. But how does Big Data Analysis work, exactly? In order to find that out, I attended the workshop on “Deep Learning with Keras...

Big Data
Data
AI
Machine Learning

6.2.2018 | 6 [Missing String "readingTime"]

Shirin Elsinghorst

BigchainDB – The lightweight blockchain framework [blockcentric #5]

With BigchainDB we see one of the first complete but simple blockchain frameworks. The project strives to make blockchain usable for a large number of developers and use cases without requiring special knowledge in cryptography and distributed systems...

Big Data
Blockchain

3.1.2018 | 5 [Missing String "readingTime"]

Explore Predictive Maintenance with flexdashboard

Predictive MaintenancePredictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).A common use case for Predictive...

Big Data
Data
Machine Learning

2.11.2017 | 3 [Missing String "readingTime"]

Shirin Elsinghorst

Data Science for Fraud Detection

What is fraud and why is it interesting for Data Science?Fraud can be defined as “the crime of getting money by deceiving people” (Cambridge Dictionary); it is as old as humanity: whenever two parties exchange goods or conduct business, there is the ...

Big Data
Data
Machine Learning

5.9.2017 | 10 [Missing String "readingTime"]

Shirin Elsinghorst

Lookup additional data in Spark Streaming

When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event.In this blog post I would like to discuss...

Software architecture
Scala
Big Data
Data
Streaming

1.6.2017 | 8 [Missing String "readingTime"]

Matthias Niehoff

Event time processing in Apache Spark and Apache Flink

With the new release of Spark 2.1, the event-time capabilities of Spark Structured Streaming have been expanded. It is time to take a closer look at the state of support and compare it with Apache Flink – which comes with a broad support for event time...

Big Data
Data
Machine Learning
Streaming

19.4.2017 | 9 [Missing String "readingTime"]

Matthias Niehoff

Distributed Stream Processing Frameworks for Fast & Big Data

Spark Streaming, Flink, Storm, Kafka Streams – that are only the most popular candidates of an ever growing range of frameworks for processing streaming data at high scale. This article is about the main concepts behind these frameworks. Furthermore...

Big Data
Data
Open Source
Messaging
Machine Learning
Streaming

26.3.2017 | 10 [Missing String "readingTime"]

Matthias Niehoff

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Messaging
Java

20.3.2017 | 9 [Missing String "readingTime"]

Validating Topic Configurations in Apache Kafka

Why does Kafka topic configuration matter?

The Evils of Automatic Topic Creation

How to automatically check your topic configuration

A first implementation attempt

Caveats

Summary

Was this post helpful?

Blog author

More articles

Living on the edge: building serverless applications with Cloudflare Workers

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries in Apache Kafka Streams

Crossing the Streams – Joins in Apache Kafka

Realtime Fast Data Analytics with Druid

The SMACK stack – hands on!

First steps with Java 9 and Project Jigsaw – Part 2

First steps with Java 9 and Project Jigsaw – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Thinking AI means re-thinking data

From PDF data sheets to shared understanding with serverless SHACL

Decrypt TLS traffic to Kafka using Wireshark

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Hands-on Spark intro: Cross Join customers and products with business ...

Move n-gram extraction into your Keras model!

Window Functions in Stream Analytics

How to write a Kotlin DSL – e.g. for Apache Kafka

ETL with Kafka

Deep Learning Workshop at codecentric AG in Solingen

BigchainDB – The lightweight blockchain framework [blockcentric #5]

Explore Predictive Maintenance with flexdashboard

Data Science for Fraud Detection

Lookup additional data in Spark Streaming

Event time processing in Apache Spark and Apache Flink

Distributed Stream Processing Frameworks for Fast & Big Data

Building a distributed Runtime for Interactive Queries in Apache Kafka...