Validating Topic Configurations in Apache Kafka

7.12.2017 | 8 minutes reading time

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong?

Why does Kafka topic configuration matter?

There are three main parts that define the configuration of a Kafka topic:

Partition count
Replication factor
Technical configuration

The partition count defines the level of parallelism of the topic. For example, a partition count of 50 means that up to 50 consumer instances in a consumer group can process messages in parallel. The replication factor specifes how many copies of a partition are held in the cluster to enable failover in case of broker failure. And in the technical configuration, one can define the cleanup policy (deletion or log compaction), flushing of data to disk, maximum message size, permitting unclean leader elections and so on. For a complete list, see https://kafka.apache.org/documentation/#topicconfigs . Some of these properties are quite easy to change at runtime. For others this is a lot harder, though.

Let’s take the partition count. Increasing it upwards is easy – just run

bin/kafka-topics.sh --alter --zookeeper zk:2181 --topic mytopic --partitions 42

This might be sufficient for you. Or it might open the fiery gates of hell and break your application. The latter is the case if you depend on all messages for a given key landing on the same partition (to be handled by the same consumer in a group) or for example if you run a Kafka Streams application. If that application uses joins , the involved topics need to be copartitioned, meaning that they need to have the same partition count (and producers using the same partitioner, but that is hard to enforce). Even without joins, you don’t want messages with the same key end up in different KTables .

Changing the replication factor is serious business. It is not a case of simply saying “please increase the replication factor to x” as it is with the partition count. You need to completely reassign partitions to brokers, specifying the preferred leader and n replicas for each partition. It is your task to distribute those well across your cluster. This is no fun for anyone involved. Practical experience with this has actually led to this blog post.
The technical configuration has an impact as well. It could be for example quite essential that a topic is using compaction instead of deletion if an application depends on that. You also might find the retention time too small or too big.

The Evils of Automatic Topic Creation

In a recent project, a central team managed the Kafka cluster. This team kept a lot of default values in the broker configuration. This is mostly sensible as Kafka comes with pretty good defaults. However, one thing they kept was auto.create.topics.enable=true. This property means that whenever a client tries to write to or read from a non-existing topic, Kafka will automatically create it. Defaults for partition count and replication factor were kept at 1.

This led to the situation where the team forgot to set up a new topic manually before running producers and consumers. Kafka created that topic with default configuration. Once this was noticed, all applications were stopped and the topic deleted – only to be created again automatically seconds later, presumably because the team didn’t find all clients. “Ok”, they thought, “let’s fix it manually”. They increased the partition count to 32, only to realize that they had to provide the complete partition assignment map to fix the replication factor. Even with tool support from Kafka Manager, this didn’t give the team members a great feeling. Luckily, this was only a development cluster, so nothing really bad happened. But it was easy to conceive that this could also happen in production as there are no safeguards.

Another danger of automatic topic creation is the sensitivity to typos. Let’s face it – sometimes we all suffer from butterfingers. Even if you took all necessary care to correctly create a topic called “parameters”, you might end up with something like

Automatic topic creating means that your producer thinks everything is fine, and you’ll scratch your head as to why your consumers don’t receive any data.

Another conceivable issue is that a developer that maybe is not yet that familiar with the Producer API might confuse the String parameters in the send method

So while our developer meant to assign a random value to the message key, he accidentally set a random topic name. Every time a message is produced, Kafka creates a new topic.

So why don’t we just switch automatic topic creation off? Well, if you can: do it. Do it now! Sadly, the team didn’t have that option. But an idea was born – what would be the easiest way to at least fail fast at application startup when something is different than expected?

How to automatically check your topic configuration

In older versions of Kafka, we basically used the code called by the kafka-topics.sh script to programmatically work with topics. To create a topic for example we looked at how to use kafka.admin.CreateTopicCommand. This was definitely better than writing straight to Zookeeper because there is no need to replicate the logic of “which ZNode goes where”, but it always felt like a hack. And of course we got a dependency on the Kafka broker in our code – definitely not great.

Kafka 0.11 implemented KIP-117 , thus providing a new type of Kafka client – org.apache.kafka.clients.admin.AdminClient . This client enables users to programmatically execute admin tasks without relying on those old internal classes or even Zookeeper – all Zookeeper tasks are executed by brokers.

With AdminClient, it’s fairly easy to query the cluster for the current configuration of a topic. For example, this is the code to find out if a topic exists and what its partition count and replication factor is:

The DescribeTopicsResult contains all the info required to find out if the topic exists and how partition count and replication factor are set. It’s asynchronous, so be prepared to work with Futures to get your info.

Getting configs like cleanup.policy works similarly, but uses a different method:

Under the hood there is the same Future-based mechanism.

A first implementation attempt

If you are in a situation where your application depends on a certain configuration for the Kafka topics you use, it might make sense to fail early when something is not right. You get instant feedback and have a chance to fix the problem. Or you might at least want to emit a warning in your log. In any case, as nice as the AdminClient is, this check is not something you should have to implement yourself in every project.

Thus, the idea for a small library was born. And since naming things is hard, it’s called “Club Topicana”.

With Club Topicana, you can check your topic configuration every time you create a Kafka Producer, Consumer or Streams client.

Expectations can be expressed programmatically or configuratively. Programmatically, it uses a builder:

This basically says “I expect the topic test_topic to exist. It should also have 32 partitions and a replication factor of 3. I also expect the cleanup policy to be delete. Kafka should retain messages for at least 30 seconds.”

Another option to specify an expected configuration is YAML (parser is included):

What do you do with those expectations? The library provides factories for all Kafka clients that mirror their public constructors and additionally expects a collection of expected topic configurations. For example, creating a producer can look like this:

The last line throws a MismatchedTopicConfigException if the actual configuration does not meet expectations. The message of that exception lists the differences. It also provides access to the computed result so users can react to it in any way they want.

The code for consumers and streams clients looks similar. Examples are available on GitHub . If all standard clients are created using Club Topicana, an exception will prevent creation of a client and thus auto creation of a topic. Even if auto creation is disabled, it might be valuable to ensure that topics have the correct configuration.

There is also a Spring client. The @EnableClubTopicana annotation triggers Club Topicana to read YAML configuration and execute the checks. You can configure if you want to just log any mismatches or if you want to let the creation of the application context fail.

This is all on GitHub and available on Maven Central.

Caveats

Club Topicana will not notice when someone changes the configuration of a topic after your application has successfully started. It also of course cannot guard against other clients doing whatever on Kafka.

Summary

The configuration of your Kafka topics is an essential part of running your Kafka applications. Wrong partition count? You might not get the parallelism you need or your streams application might not even start. Wrong replication factor? Data loss is a real possibility. Wrong cleanup policy? You might lose messages that you depend on later. Sometimes, your topics might be auto-generated and come with bad defaults that you have to fix manually. With the AdminClient introduced in Kafka 0.11, it’s simple to write a library that compares actual and desired topic configurations at application startup.

Was this post helpful?

Blog author

Florian Troßbach

Do you still have questions? Just send me a message.

MotherDuck Dives: From Natural Language to Live Dashboards

Dives are interactive visualizations created through natural language, built directly on top of data in MotherDuck. Users describe what they want to see, and an AI agent generates a persistent, interactive component that lives in their workspace alongside...

MotherDuck
Data
Data Science
Big Data

9.3.2026 | 8 minutes reading time

Niklas Niggemann

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

In our previous benchmarks, DuckDB consistently outperformed Polars and Pandas on large analytical workloads, but performance comparisons miss a critical question: what happens when you need to move from local DuckDB development to a BigQuery production...

MotherDuck
Data
Big Data
Data Science

10.2.2026 | 6 minutes reading time

Niklas Niggemann

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Decrypt TLS traffic to Kafka using Wireshark

Usually, debugging issues related to TLS in a Java application involves setting the debug flag -Djavax.net.debug=ALL. Unfortunately, the logs will then be flooded with debug entries, which makes troubleshooting the TLS communication difficult. Moreover...

Software development
Messaging
IT-Security

30.1.2020 | 4 minutes reading time

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Having worked with Google Cloud Platform’s Big Data Services for almost a year, I wanted to have a broader view on GCP’s capabilities. In this post, I will give you an overview of the services touched by the Coursera specialization . I have been a GCP...

Cloud
Computer Vision
Data
Machine Learning
Big Data
Google Cloud
Serverless

8.9.2019 | 6 minutes reading time

Niklas Haas

Hands-on Spark intro: Cross Join customers and products with business ...

In this blog post, I want to share my aha moments with you I had during the development of my first (Py) Spark application.We do this by an example application:Read customers, products, and transactions.Create the cross join of customers and products...

Python
Big Data
Data

5.8.2019 | 20 minutes reading time

Niklas Haas

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model!In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 minutes reading time

Window Functions in Stream Analytics

Introduction to Stream AnalyticsWhy should we talk about stream analytics? In the past decades data analytics was dominated by batch processing. Records from transactional databases were copied into analytical databases by regular extract-transform-load...

Big Data
Data
Streaming

11.10.2018 | 11 minutes reading time

How to write a Kotlin DSL – e.g. for Apache Kafka

The Kotlin language is gaining more and more attention and is being used in an increasing number of projects. One thing that Kotlin can be used for is implementing special domain-specific-languages (DSLs). The Wikipedia entry on DSL states:A domain...

Messaging
DSL
Kotlin

23.6.2018 | 10 minutes reading time

ETL with Kafka

“ETL with Kafka” is a catchy phrase that I purposely chose for this post instead of a more precise title like “Building a data pipeline with Kafka Connect”.TLDRYou don’t need to write any code for pushing data into Kafka, instead just choose your connector...

Messaging
Big Data

2.3.2018 | 4 minutes reading time

Deep Learning Workshop at codecentric AG in Solingen

Big Data – a buzz word you can find everywhere these days, from nerdy blogs to scientific research papers and even in the news. But how does Big Data Analysis work, exactly? In order to find that out, I attended the workshop on “Deep Learning with Keras...

Big Data
Data
AI
Machine Learning

6.2.2018 | 6 minutes reading time

Shirin Elsinghorst

BigchainDB – The lightweight blockchain framework [blockcentric #5]

With BigchainDB we see one of the first complete but simple blockchain frameworks. The project strives to make blockchain usable for a large number of developers and use cases without requiring special knowledge in cryptography and distributed systems...

Big Data
Blockchain

3.1.2018 | 5 minutes reading time

Explore Predictive Maintenance with flexdashboard

Predictive MaintenancePredictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).A common use case for Predictive...

Big Data
Data
Machine Learning

2.11.2017 | 3 minutes reading time

Shirin Elsinghorst

Data Science for Fraud Detection

What is fraud and why is it interesting for Data Science?Fraud can be defined as “the crime of getting money by deceiving people” (Cambridge Dictionary); it is as old as humanity: whenever two parties exchange goods or conduct business, there is the ...

Big Data
Data
Machine Learning

5.9.2017 | 10 minutes reading time

Shirin Elsinghorst

Validating Topic Configurations in Apache Kafka

Why does Kafka topic configuration matter?

The Evils of Automatic Topic Creation

How to automatically check your topic configuration

A first implementation attempt

Caveats

Summary

Was this post helpful?

Blog author

More articles in this subject area

MotherDuck Dives: From Natural Language to Live Dashboards

Ibis: Selecting the Right Execution Engine Without Rewriting Your Logic

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Thinking AI means re-thinking data

From PDF data sheets to shared understanding with serverless SHACL

Decrypt TLS traffic to Kafka using Wireshark

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Hands-on Spark intro: Cross Join customers and products with business ...

Move n-gram extraction into your Keras model!

Window Functions in Stream Analytics

How to write a Kotlin DSL – e.g. for Apache Kafka

ETL with Kafka

Deep Learning Workshop at codecentric AG in Solingen

BigchainDB – The lightweight blockchain framework [blockcentric #5]

Explore Predictive Maintenance with flexdashboard

Data Science for Fraud Detection