ETL with Kafka

2.3.2018 | 4 minutes reading time

“ETL with Kafka” is a catchy phrase that I purposely chose for this post instead of a more precise title like “Building a data pipeline with Kafka Connect”.

TLDR

You don’t need to write any code for pushing data into Kafka, instead just choose your connector and start the job with your necessary configurations. And it’s absolutely Open Source!

Kafka Connect

Kafka

Before getting into the Kafka Connect framework, let us briefly sum up what Apache Kafka is in couple of lines. Apache Kafka was built at LinkedIn to meet the requirements that message brokers already existing in the market did not meet – requirements such as scalable, distributed, resilient with low latency and high throughput. Currently, i.e. 2018, LinkedIn is processing about 1.8 petabytes of data per day through Kafka. Kafka offers a programmable interface (API) for a lot of languages to produce and consume data.

Kafka Connect

Kafka Connect has been built into Apache Kafka since version 0.9 (11/2015), although the idea had been in existence before this release, but as a project named Copycat. Kafka Connect is basically a framework around Kafka to get data from different sources in and out of Kafka (sinks) into other systems e.g. Cassandra with automatic offset management, where as a user of the connector you don’t need to worry about this, but rely on the developer of the connector.

Besides that, in discussions I have often come across people who were thinking that Kafka Connect was part of the Confluent Enterprise and not a part of Open Source Kafka. To my surprise, I have even heard it from a long-term Kafka developer. That confusion might be due to the fact that if you google the term Kafka Connect, the first few pages on Google are by Confluent and the list of certified connectors.

Kafka Connect has basically three main components that need to be understood for a deeper understanding of the framework.

Connectors are, in a way, the “brain” that determine how many tasks will run with the configurations and how the work is divided between these tasks. For example, the JDBC connector can decide to parallelize the process to consume data from a database (see figure 2).
Tasks contain the main logic of getting the data into Kafka from external systems by connecting e.g. to a database (Source Task) or consuming data from Kafka and pushing it to external systems (Sink Task).
Workers are the part that abstracts away from the connectors and tasks in order to provide a REST API (main interaction), reliability, high availability, scaling, and load balancing.

Standalone

Kafka connect can be started in two different modes. The first mode is called standalone and should be used only in development because offsets are being maintained on the file system. This would be really bad if you were running this mode in production and your machine was unavialable. This could cause the loss of the state, which means the offset is lost and you as a develeoper don’t know how much data has been processed.

1# connnect-standalone.properties
2offset.storage.file.filename=/tmp/connect.offsets

Distributed

The second mode is called distributed. There, the configuration, state and status are stored in Kafka itself in different topics which benefit from all Kafka characteristics such as resilience and scalability. Workers can start on different machines and the group.id attribute in the .properties file will eventually form the Kafka Connect Cluster which can be scaled up or down.

1# connnect-distributed.properties
2group.id=connect-cluster
3config.storage.topic=connect-configs
4offset.storage.topic=connect-offsets
5status.storage.topic=connect-status

So let’s look in the content of the pretty self-explanatory topic use in the configuration file:

// TOPIC => connect-configs
{"properties": 
 {"connector.class":"c.e.t.k.c.twitter.TwitterSourceConnector",
  "twitter.token":"XXXX","tasks.max":"1","track.terms":"frankfurt",
  "task.class":"c.e.t.k.c.twitter.TwitterSourceTask",
  "twitter.secret":"XXX","name":"twitter-source","topic":
 "twitter", "twitter.consumersecret":"XXXXXX", 
 "twitter.consumerkey":"XXXXX"}}
{"tasks":1}
{"state":"STARTED"}

// TOPIC => connect-offsets
{"tweetId":968476484095610880}
{"tweetId":968476527108263936}

// TOPIC => connect-status
{"state":"RUNNING","trace":null,"worker_id":"connect:8083",
 "generation":2}
{"state":"UNASSIGNED","trace":null,"worker_id":"connect:8083",
 "generation":2}
{"state":"RUNNING","trace":null,"worker_id":"connect:8083",
 "generation":3}

The output shown here of the messages are just the values, the key of the message is used to identify the different connectors.

Interaction pattern

There is also a different interaction pattern normally between the standalone and distributed mode – in a non-production environment where you just want to test out a connector, for example, and you want to set manually the offset of your choice. You can start the standalone mode with passing in the sink or source connector that you want to use, e.g. bin/kafka-connect config/connect-standalone.properties config/connect-file-source.properties config/other-connector.properties.

On the other hand, you can start the Kafka Connect worker in the distributed mode with the following command: bin/kafka-connect config/connect-distributed.properties. After that, you can list all available connectors, start, change configurations on the fly, restart, pause and remove connectors via the exposed REST API of the framework. A full list of supported endpoints can be found in the offical Kafka Connect documentation .

Example

So let’s have a closer look at an example of a running data pipeline where we are getting some real time data from Twitter and using the kafka-console-consumer to consume and inspect the data.

Here is the complete example shown in the terminal recording: Github repository . You can download and play around with the example project.

Conclusion

In this blog post, we covered the high-level components that are the building blocks of the Kafka Connect framework. The latter is a part of the Apache Kafka Open Source version that allows data engineers or business departments to move data from one system to another without writing any code via Apache Kafka’s great characteristics, of which we barely scratched the surface in this post. So happy connecting…

Was this post helpful?

Blog author

Akhlaq Malik

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 [Missing String "readingTime"]

Dr. Florian Rademacher

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 [Missing String "readingTime"]

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 [Missing String "readingTime"]

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 [Missing String "readingTime"]

Marcel Mikl

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 [Missing String "readingTime"]

Decrypt TLS traffic to Kafka using Wireshark

Usually, debugging issues related to TLS in a Java application involves setting the debug flag -Djavax.net.debug=ALL. Unfortunately, the logs will then be flooded with debug entries, which makes troubleshooting the TLS communication difficult. Moreover...

Software development
Messaging
IT-Security

30.1.2020 | 4 [Missing String "readingTime"]

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Having worked with Google Cloud Platform’s Big Data Services for almost a year, I wanted to have a broader view on GCP’s capabilities. In this post, I will give you an overview of the services touched by the Coursera specialization . I have been a GCP...

Cloud
Computer Vision
Data
Machine Learning
Big Data
Google Cloud
Serverless

8.9.2019 | 6 [Missing String "readingTime"]

Niklas Haas

Hands-on Spark intro: Cross Join customers and products with business ...

In this blog post, I want to share my aha moments with you I had during the development of my first (Py) Spark application.We do this by an example application:Read customers, products, and transactions.Create the cross join of customers and products...

Python
Big Data
Data

5.8.2019 | 20 [Missing String "readingTime"]

Niklas Haas

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model!In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 [Missing String "readingTime"]

Window Functions in Stream Analytics

Introduction to Stream AnalyticsWhy should we talk about stream analytics? In the past decades data analytics was dominated by batch processing. Records from transactional databases were copied into analytical databases by regular extract-transform-load...

Big Data
Data
Streaming

11.10.2018 | 11 [Missing String "readingTime"]

How to write a Kotlin DSL – e.g. for Apache Kafka

The Kotlin language is gaining more and more attention and is being used in an increasing number of projects. One thing that Kotlin can be used for is implementing special domain-specific-languages (DSLs). The Wikipedia entry on DSL states:A domain...

Messaging
DSL
Kotlin

23.6.2018 | 10 [Missing String "readingTime"]

Deep Learning Workshop at codecentric AG in Solingen

Big Data – a buzz word you can find everywhere these days, from nerdy blogs to scientific research papers and even in the news. But how does Big Data Analysis work, exactly? In order to find that out, I attended the workshop on “Deep Learning with Keras...

Big Data
Data
AI
Machine Learning

6.2.2018 | 6 [Missing String "readingTime"]

Shirin Elsinghorst

BigchainDB – The lightweight blockchain framework [blockcentric #5]

With BigchainDB we see one of the first complete but simple blockchain frameworks. The project strives to make blockchain usable for a large number of developers and use cases without requiring special knowledge in cryptography and distributed systems...

Big Data
Blockchain

3.1.2018 | 5 [Missing String "readingTime"]

Validating Topic Configurations in Apache Kafka

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong?Why does Kafka topic configuration matter...

Messaging
Big Data

7.12.2017 | 8 [Missing String "readingTime"]

Explore Predictive Maintenance with flexdashboard

Predictive MaintenancePredictive Maintenance is an increasingly popular strategy associated with Industry 4.0; it uses advanced analytics and machine learning to optimize machine costs and output (see Google Trends plot below).A common use case for Predictive...

Big Data
Data
Machine Learning

2.11.2017 | 3 [Missing String "readingTime"]

Shirin Elsinghorst

Data Science for Fraud Detection

What is fraud and why is it interesting for Data Science?Fraud can be defined as “the crime of getting money by deceiving people” (Cambridge Dictionary); it is as old as humanity: whenever two parties exchange goods or conduct business, there is the ...

Big Data
Data
Machine Learning

5.9.2017 | 10 [Missing String "readingTime"]

Shirin Elsinghorst

Lookup additional data in Spark Streaming

When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event.In this blog post I would like to discuss...

Software architecture
Scala
Big Data
Data
Streaming

1.6.2017 | 8 [Missing String "readingTime"]

Matthias Niehoff

Event time processing in Apache Spark and Apache Flink

With the new release of Spark 2.1, the event-time capabilities of Spark Structured Streaming have been expanded. It is time to take a closer look at the state of support and compare it with Apache Flink – which comes with a broad support for event time...

Big Data
Data
Machine Learning
Streaming

19.4.2017 | 9 [Missing String "readingTime"]

Matthias Niehoff

Distributed Stream Processing Frameworks for Fast & Big Data

Spark Streaming, Flink, Storm, Kafka Streams – that are only the most popular candidates of an ever growing range of frameworks for processing streaming data at high scale. This article is about the main concepts behind these frameworks. Furthermore...

Big Data
Data
Open Source
Messaging
Machine Learning
Streaming

26.3.2017 | 10 [Missing String "readingTime"]

Matthias Niehoff

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries are a fairly new feature of Apache Kafka Streams that provides programmatic access to the internal state held by a streaming application. However, the Kafka API only provides access to the state that is held locally by an instance...

Messaging
Java

20.3.2017 | 9 [Missing String "readingTime"]

ETL with Kafka

TLDR

Kafka Connect

Kafka

Kafka Connect

Standalone

Distributed

Interaction pattern

Example

Conclusion

Was this post helpful?

Blog author

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Thinking AI means re-thinking data

From PDF data sheets to shared understanding with serverless SHACL

Decrypt TLS traffic to Kafka using Wireshark

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Hands-on Spark intro: Cross Join customers and products with business ...

Move n-gram extraction into your Keras model!

Window Functions in Stream Analytics

How to write a Kotlin DSL – e.g. for Apache Kafka

Deep Learning Workshop at codecentric AG in Solingen

BigchainDB – The lightweight blockchain framework [blockcentric #5]

Validating Topic Configurations in Apache Kafka

Explore Predictive Maintenance with flexdashboard

Data Science for Fraud Detection

Lookup additional data in Spark Streaming

Event time processing in Apache Spark and Apache Flink

Distributed Stream Processing Frameworks for Fast & Big Data

Building a distributed Runtime for Interactive Queries in Apache Kafka...