Lookup additional data in Spark Streaming

1.6.2017 | 7 minutes reading time

When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event.

In this blog post I would like to discuss various ways to solve this problem in Spark Streaming. The examples assume that the additional data is initially outside the streaming application and can be read over the network – for example in a database. All samples and techniques refer to Spark Streaming and not to Spark Structured Streaming. The main techniques are

broadcast: static data
mapPartitions: for volatile data
mapPartitions + connection broadcast: effective connection handling
mapWithState: speed up by a local state

Broadcast

Spark has an integrated broadcasting mechanism that can be used to transfer data to all worker nodes when the application is started. This has the advantage, in particular with large amounts of data, that the transfer takes place only once per worker node and not with each task.

However, because the data can not be updated later, this is only an option if the metadata is static. This means that no additional data, for example information about new sensors, may be added, and no data may be changed. In addition, the transferred objects must be serializable.

In this example, each sensor type, stored as a numerical ID (1,2, …), is to be replaced by a plain-text name in the stream processing (tire temperature, tire pressure, ..). It is assumed that the assignment type ID -> name is fixed.

1val namesForId: Map[Long,String] = Map(1 -> "Wheel-Temperature", 2 -> "Wheel-Pressure")
2stream.map (typId => (typId,namesForId(typId)))

A lookup without broadcast. The map is serialized for each task and transferred to the worker nodes, even if tasks were previously executed on the worker.

1val namesForId: Map[Long,String] = Map(1 -> "Wheel-Temperature", 2 -> "Wheel-Pressure")
2val namesForIdBroadcast = sc.broadcast(namesForId)
3stream.map (typId => (typId,namesForIdBroadcast.value(typId)))

The map is distributed to the workers via a broadcast and no longer has to be transferred for each task.

MapPartitions

The first way to read non-static data is in a map() operation. However, not map() should be used but mapPartitions(). mapPartitions() is not called for every single element, but for each partition, which then contains several elements. This allows to connect to the database only once per partition and then to reuse the connection for all elements.

There are two different ways to query the data: Use a bulk API to process all elements of the partition together, or an asynchronous variant: an asynchronous, non-blocking query is issued for each entry and the results are then collected.

1wikiChanges.mapPartitions(elements => {
2  Session session = // create database connection and session
3  PreparedStatement preparedStatement = // prepare statement, if supported by database
4  elements.map(element => {
5    // extract key from element and bind to prepared statement
6    BoundStatement boundStatement = preparedStatement.bind(???)
7    session.asyncQuery(boundStatement) // returns a Future
8  })
9  .map(...) //retrieve value from future
10})

An example for an lookup on data stored in Cassandra using mapPartitions and asynchronous queries

The above example shows a lookup using mapPartitions: expensive operations like opening the connection are only done once per partition. An asynchronous, non-blocking query is issued for each element, and then the values are determined from the futures. Some libraries for reading from databases mainly use this pattern, such as the joinWithCassandraTable from the Spark Cassandra Connector .

Why is the connection not created at the beginning of the job and then used for each partition? For this purpose, the connection would have to be serialized and then transferred to the workers for each task. The amount of data would not be too large, but most connection objects are not serializable.

Broadcast Connection + MapPartitions

However, it is a good idea not to rebuild the connection for each partition, but only once per worker node. To achieve this, the connection is not broadcasted because it is not serializable (see above), but instead a factory that builds the connection on the first call and then returns this connection on all other calls. This function is then called in mapPartitions() to get the connection to the database.

In Scala it is not necessary to use a function for this. Here a lazy val can be used. The lazy val is defined within a wrapper class. This class can be serialized and broadcasted. On the first call, an instance of the non-serializable connection class is created on the worker node and then returned for every subsequent call.

1class DatabaseConnection extends Serializable {
2  lazy val connection: AConnection = {
3    // all the stuff to create the connection
4    new AConnection(???)
5  }
6}
7val connectionBroadcast = sc.broadcast(new DatabaseConnection)
8incomingStream.mapPartitions(elements => {
9  val connection = connectionBroadcast.value.connection
10  // see above
11})

A connection creation object is broadcasted and then used to retrieve the actual connection on the worker node.

MapWithState()

All solution approaches shown so far retrieve the data from a database, if necessary. This usually means a network call for each entry or at least for each partition. It would be more efficient to have the data directly in-memory available.

With mapWithState() Spark itself offers a way to change data by means of a state and, in turn, also to adjust the state. The state is managed by a key. This key is used to distribute the data in the cluster, so that all data must not be kept on each worker node. An incoming stream must therefore also be constructed as a key-value pair.

This keyed state can also be used for a lookup. By means of initialState(), an RDD can be passed as an initial state. However, any updates can only be performed based on a key. This also applies to deleting entries. It is not possible to completely delete or reload the state.

To update the state, additional notification events must be present in the stream. These can, for example, come from a separate Kafka topic and must be merged with the actual data stream (union()). The amount of data sent, can range from a simple notification with an ID, which is then used to read the new data, to the complete new data set.

Messages are published to the Kafka topic, for example, if metadata is updated or newly created. In addition, timed events can be published to the Kafka topic or can be generated by a custom receiver in Spark itself.

A simple implementation can look like this. First, the Kafka topics are read and the keys are additionally supplemented with a marker for the data type (data or notification). Then, both streams are merged into a common stream and processed in mapWithState(). The state was previously specified by passing the function of the state to the StateSpec.

1val kafkaParams = Map("metadata.broker.list" -> brokers)
2val notifications = notificationsFromKafka
3  .map(entry => ((entry._1, "notification"), entry._2))
4val data = dataFromKafka
5  .map(entry => ((entry._1, "data"), entry._2))
6val lookupState = StateSpec.function(lookupWithState _)
7notifications
8  .union(data)
9  .mapWithState(lookupState)

The lookupWithState function describes the processing in the state. The following parameters are passed:

batchTime: the start time of the current microbatch
key: the key, in this case the original key from the stream, together with the type marker (data or notification)
valueOpt: the value to the key in the stream
state: the value stored in the state for the key

A tuple consisting of the original key and the original value as well as a number will be returned. The number is taken from the state or – if not already present in the state – is chosen randomly.

1def lookupWithState(batchTime: Time, key: (String, String), valueOpt: Option[String], state: State[Long]): Option[((String, String), Long)] = {
2  key match {
3    case (originalKey, "notification") =>
4      // retrieve new value from notification or external system
5      val newValue = Random.nextLong()
6      state.update(newValue)
7      None // no downstream processing for notifications
8    case (originalKey, "data") =>
9      valueOpt.map(value => {
10        val stateVal = state.getOption() match {
11          // check if there is a state for the key
12          case Some(stateValue) => stateValue
13          case None =>
14            val newValue = Random.nextLong()
15            state.update(newValue)
16            newValue
17        }
18      ((originalKey, value), stateVal)
19      })
20  }
21}

In addition, the timeout mechanism of the mapWithState() can also be used to remove events after a certain time without updating from the state.

Conclusion

Loading additional information is a common problem in streaming applications. With Spark Streaming, there are a number of ways to accomplish this.

The easiest way is to broadcast static data at the start of the application. For volatile data, read per partition is easy to implement and provides a solid performance. With the use of the Spark states, the speed can be increased further, but it is more complex to develop.

Optimally, the data is always directly present on the worker node, on which the data is processed. This is the case, for example, with the use of Spark states. Kafka streams pursue this approach even more consistently. Here, a table is treated as a stream and – provided the streams are identical partitioned – distributed in the same way as the original stream. This makes local lookups possible.

Apache Flink is also working on efficient lookups, here under the title Side Inputs .

Was this post helpful?

Blog author

Matthias Niehoff

Head of Data

Do you still have questions? Just send me a message.

MotherDuck: Access Management and Scalable Analytics Overview

MotherDuck's architecture for storage management and user access is built on several key design principles that shape how data is organized and shared. To understand how MotherDuck manages access control, you need to understand three key concepts: organizations...

Data
MotherDuck

8.12.2025 | 6 minutes reading time

Hendrik Kamp

DuckDB vs. DataFrame Libraries

Update 10.12.25 – After helpful insights from Polars Engineer Thijs Nieuwdorp following the initial posting of this article, we were able to refactor our use of the deprecated .count() function in Polars, replacing it with the correct .len() function...

MotherDuck
Data
Data Science
Python
Database

1.12.2025 | 10 minutes reading time

Niklas Niggemann

ODPS: The Standard for Data Products

The data landscape in an organization often looks like this: teams gather and produce data everyday. Each team develops their own metadata models and documentation, if there is any at all. Governance policies exist in scattered documentation (spreadsheets...

Data

7.11.2025 | 4 minutes reading time

DuckDB and MotherDuck for customer facing analytics

MotherDuck
Data

21.10.2025 | 5 minutes reading time

Matthias Niehoff

DuckDB’s friendly SQL is a game changer for developer experience

I don’t think anyone will be surprised when I say that SQL is not the nicest language to work with. Some might even say that it has terrible ergonomics, especially for larger and more complex queries. Still, there are very good reasons why SQL is the...

Data
MotherDuck

14.10.2025 | 12 minutes reading time

Zero-ETL with MotherDuck: A Technical Deep Dive

MotherDuck, the cloud-native service built on DuckDB, fundamentally transforms how organizations interact with data stored in cloud blob storage. By eliminating the traditional ETL/ELT pipeline, MotherDuck enables direct SQL analytics on Parquet, JSON...

MotherDuck
Data

7.10.2025 | 6 minutes reading time

Hendrik Kamp

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

In this post, we'll explore how MotherDuck, powered by DuckDB, revolutionizes the way you interact with your data, particularly when dealing with CSV files. You'll learn how to quickly parse and filter even large datasets directly from your local machine...

Data
Database
MotherDuck
Big Data

30.9.2025 | 8 minutes reading time

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

5 Reasons We’re Excited About MotherDuck’s Launch in AWS Frankfurt For some time, a key challenge for European data teams has been balancing innovation with strict regulation. We’ve often seen powerful tools launch first in the US, while our need for...

Data
Big Data
Database
News
MotherDuck

24.9.2025 | 6 minutes reading time

Marcel Mikl

Using Dagster with DuckDB

DuckDB has rapidly emerged as a popular in-process analytics database. Dagster, on the other hand, is a modern data orchestration framework that makes it easy to build and manage data pipelines. Combining Dagster with DuckDB allows data engineers to ...

Data

16.5.2025 | 4 minutes reading time

Hendrik Kamp

Querying Databricks Delta Tables in Motherduck

Intro In a previous article, my colleague Matthias Niehoff demonstrated how duckdb can serve as a viable alternative to Spark for processing data stored in Databricks, specifically by directly accessing the Unity Catalog. Building upon that, a next ...

Data

25.4.2025 | 4 minutes reading time

Hendrik Kamp

Pull off Architecture Reviews at Light-Speed with LASR!

Foreword: This blog is loosely based on a recent project experience. All persons, companies and names are fictitious, as to make them NDA compliant. Any resemblance to a person, existing company or brand is purely coincidental and unintentional.For most...

Software architecture

4.4.2025 | 13 minutes reading time

Introducing Data Interface Quadrants (DIQs)

In today’s rapidly evolving, data-driven world, organisations face an increasingly complex challenge: how to design, implement, and manage data interfaces that meet both immediate operational demands and long-term strategic business objectives. A data...

API
Data

30.1.2025 | 8 minutes reading time

Daniel Kocot

Miriam Greis

Feature-Sliced Design and what we need for good frontend architecture

Feature-Sliced Design and what we need for good frontend architecture While a lot has been published on the topic of software architecture in the backend, and there are well-established best practices, this topic is less prominent for frontend applications...

Software architecture
Frontend

23.1.2025 | 10 minutes reading time

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 minutes reading time

Danny Keller

Access Databricks UnityCatalog from duckdb

Databricks is a great platform when it comes to data management and governance, mostly due to the unity catalog. But Spark as an engine for processing the data is just ok'ish, especially when data is not really big. New engines like polars, datafusion...

Data

20.1.2025 | 5 minutes reading time

Matthias Niehoff

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal...

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal Architecture Modularization is a key concept in modern software development to make applications maintainable, testable and flexible. In this article we will see how Spring Modulith...

Software architecture
Kotlin
Spring

14.1.2025 | 9 minutes reading time

Danny Keller

Charge your APIs Volume 36 - Trends for 2025

As 2025 approaches, new trends are emerging in the world of APIs. After 2024 was user-centric, the focus is now shifting back to developer needs and increasing productivity. APIs are evolving and the technologies surrounding them are becoming more powerful...

Integration
API
Data
Software architecture

11.12.2024 | 5 minutes reading time

Daniel Kocot

ArchUnit in practice: Keep your Architecture Clean

Who hasn’t been there: A new project kicks off or the old code finally needs a cleanup. A big meeting with all the developers is called: “This time, we’ll do it right—clean, correct, and structured!” Architecture Decision Records (ADRs) are created to...

Software architecture
Java
Kotlin
Software development

20.9.2024 | 18 minutes reading time

Danny Keller

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

API gateways are essential for managing and securing data flow between services. As software architectures evolve, different types of API gateways have emerged to address specific challenges: Legacy, Agnostic, and Kubernetes-native. Drawing on insights...

API
Software architecture
Infrastructure
Integration

21.8.2024 | 12 minutes reading time

Daniel Kocot

When Business Meets Technology: From Data Product to Data Architecture...

Abstract The Data Product Canvas (DPC) is a tool for the lightweight and iterative definition of data products. It increases the efficiency of product definition by clearly presenting the key impact areas on data products. Additionally, the DPC motivates...

Software architecture
Data
DDD
Digital product developement

6.8.2024 | 24 minutes reading time

Dr. Florian Rademacher

Lookup additional data in Spark Streaming

Broadcast

A lookup without broadcast. The map is serialized for each task and transferred to the worker nodes, even if tasks were previously executed on the worker.

The map is distributed to the workers via a broadcast and no longer has to be transferred for each task.

MapPartitions

An example for an lookup on data stored in Cassandra using mapPartitions and asynchronous queries

Broadcast Connection + MapPartitions

A connection creation object is broadcasted and then used to retrieve the actual connection on the worker node.

MapWithState()

Conclusion

Was this post helpful?

Blog author

More articles in this subject area

MotherDuck: Access Management and Scalable Analytics Overview

DuckDB vs. DataFrame Libraries

ODPS: The Standard for Data Products

DuckDB and MotherDuck for customer facing analytics

DuckDB’s friendly SQL is a game changer for developer experience

Zero-ETL with MotherDuck: A Technical Deep Dive

Your First Data Analysis with MotherDuck and DuckDB: From CSV to Insights...

5 Reasons Why We’re Excited About MotherDuck Launch in AWS Frankfurt

Using Dagster with DuckDB

Querying Databricks Delta Tables in Motherduck

Pull off Architecture Reviews at Light-Speed with LASR!

Introducing Data Interface Quadrants (DIQs)

Feature-Sliced Design and what we need for good frontend architecture

Hexagonal Architecture is just an island

Access Databricks UnityCatalog from duckdb

Modularization the easy way: Spring Modulith with Kotlin and Hexagonal...

Charge your APIs Volume 36 - Trends for 2025

ArchUnit in practice: Keep your Architecture Clean

Charge your APIs Volume 30 - Gateway to Success: Understanding and Choosing...

When Business Meets Technology: From Data Product to Data Architecture...