When processing streaming data, the raw data from the events are often not sufficient. Additional data must be added in most cases, for example metadata for a sensor, of which only the ID is sent in the event.
In this blog post I would like to discuss various ways to solve this problem in Spark Streaming. The examples assume that the additional data is initially outside the streaming application and can be read over the network – for example in a database. All samples and techniques refer to Spark Streaming and not to Spark Structured Streaming. The main techniques are
- broadcast: static data
- mapPartitions: for volatile data
- mapPartitions + connection broadcast: effective connection handling
- mapWithState: speed up by a local state
Broadcast
Spark has an integrated broadcasting mechanism that can be used to transfer data to all worker nodes when the application is started. This has the advantage, in particular with large amounts of data, that the transfer takes place only once per worker node and not with each task.
However, because the data can not be updated later, this is only an option if the metadata is static. This means that no additional data, for example information about new sensors, may be added, and no data may be changed. In addition, the transferred objects must be serializable.
In this example, each sensor type, stored as a numerical ID (1,2, …), is to be replaced by a plain-text name in the stream processing (tire temperature, tire pressure, ..). It is assumed that the assignment type ID -> name is fixed.
1val namesForId: Map[Long,String] = Map(1 -> "Wheel-Temperature", 2 -> "Wheel-Pressure") 2stream.map (typId => (typId,namesForId(typId)))
A lookup without broadcast. The map is serialized for each task and transferred to the worker nodes, even if tasks were previously executed on the worker.
1val namesForId: Map[Long,String] = Map(1 -> "Wheel-Temperature", 2 -> "Wheel-Pressure") 2val namesForIdBroadcast = sc.broadcast(namesForId) 3stream.map (typId => (typId,namesForIdBroadcast.value(typId)))
The map is distributed to the workers via a broadcast and no longer has to be transferred for each task.
MapPartitions
The first way to read non-static data is in a map() operation. However, not map() should be used but mapPartitions(). mapPartitions() is not called for every single element, but for each partition, which then contains several elements. This allows to connect to the database only once per partition and then to reuse the connection for all elements.
There are two different ways to query the data: Use a bulk API to process all elements of the partition together, or an asynchronous variant: an asynchronous, non-blocking query is issued for each entry and the results are then collected.
1wikiChanges.mapPartitions(elements => { 2 Session session = // create database connection and session 3 PreparedStatement preparedStatement = // prepare statement, if supported by database 4 elements.map(element => { 5 // extract key from element and bind to prepared statement 6 BoundStatement boundStatement = preparedStatement.bind(???) 7 session.asyncQuery(boundStatement) // returns a Future 8 }) 9 .map(...) //retrieve value from future 10})
An example for an lookup on data stored in Cassandra using mapPartitions and asynchronous queries
The above example shows a lookup using mapPartitions: expensive operations like opening the connection are only done once per partition. An asynchronous, non-blocking query is issued for each element, and then the values are determined from the futures. Some libraries for reading from databases mainly use this pattern, such as the joinWithCassandraTable from the Spark Cassandra Connector .
Why is the connection not created at the beginning of the job and then used for each partition? For this purpose, the connection would have to be serialized and then transferred to the workers for each task. The amount of data would not be too large, but most connection objects are not serializable.
Broadcast Connection + MapPartitions
However, it is a good idea not to rebuild the connection for each partition, but only once per worker node. To achieve this, the connection is not broadcasted because it is not serializable (see above), but instead a factory that builds the connection on the first call and then returns this connection on all other calls. This function is then called in mapPartitions() to get the connection to the database.
In Scala it is not necessary to use a function for this. Here a lazy val can be used. The lazy val is defined within a wrapper class. This class can be serialized and broadcasted. On the first call, an instance of the non-serializable connection class is created on the worker node and then returned for every subsequent call.
1class DatabaseConnection extends Serializable {
2  lazy val connection: AConnection = {
3    // all the stuff to create the connection
4    new AConnection(???)
5  }
6}
7val connectionBroadcast = sc.broadcast(new DatabaseConnection)
8incomingStream.mapPartitions(elements => {
9  val connection = connectionBroadcast.value.connection
10  // see above
11})
A connection creation object is broadcasted and then used to retrieve the actual connection on the worker node.
MapWithState()
All solution approaches shown so far retrieve the data from a database, if necessary. This usually means a network call for each entry or at least for each partition. It would be more efficient to have the data directly in-memory available.
With mapWithState() Spark itself offers a way to change data by means of a state and, in turn, also to adjust the state. The state is managed by a key. This key is used to distribute the data in the cluster, so that all data must not be kept on each worker node. An incoming stream must therefore also be constructed as a key-value pair.
This keyed state can also be used for a lookup. By means of initialState(), an RDD can be passed as an initial state. However, any updates can only be performed based on a key. This also applies to deleting entries. It is not possible to completely delete or reload the state.
To update the state, additional notification events must be present in the stream. These can, for example, come from a separate Kafka topic and must be merged with the actual data stream (union()). The amount of data sent, can range from a simple notification with an ID, which is then used to read the new data, to the complete new data set.
Messages are published to the Kafka topic, for example, if metadata is updated or newly created. In addition, timed events can be published to the Kafka topic or can be generated by a custom receiver in Spark itself.
A simple implementation can look like this. First, the Kafka topics are read and the keys are additionally supplemented with a marker for the data type (data or notification). Then, both streams are merged into a common stream and processed in mapWithState(). The state was previously specified by passing the function of the state to the StateSpec.
1val kafkaParams = Map("metadata.broker.list" -> brokers) 2val notifications = notificationsFromKafka 3 .map(entry => ((entry._1, "notification"), entry._2)) 4val data = dataFromKafka 5 .map(entry => ((entry._1, "data"), entry._2)) 6val lookupState = StateSpec.function(lookupWithState _) 7notifications 8 .union(data) 9 .mapWithState(lookupState)
The lookupWithState function describes the processing in the state. The following parameters are passed:
- batchTime: the start time of the current microbatch
- key: the key, in this case the original key from the stream, together with the type marker (data or notification)
- valueOpt: the value to the key in the stream
- state: the value stored in the state for the key
A tuple consisting of the original key and the original value as well as a number will be returned. The number is taken from the state or – if not already present in the state – is chosen randomly.
1def lookupWithState(batchTime: Time, key: (String, String), valueOpt: Option[String], state: State[Long]): Option[((String, String), Long)] = {
2  key match {
3    case (originalKey, "notification") =>
4      // retrieve new value from notification or external system
5      val newValue = Random.nextLong()
6      state.update(newValue)
7      None // no downstream processing for notifications
8    case (originalKey, "data") =>
9      valueOpt.map(value => {
10        val stateVal = state.getOption() match {
11          // check if there is a state for the key
12          case Some(stateValue) => stateValue
13          case None =>
14            val newValue = Random.nextLong()
15            state.update(newValue)
16            newValue
17        }
18      ((originalKey, value), stateVal)
19      })
20  }
21}
In addition, the timeout mechanism of the mapWithState() can also be used to remove events after a certain time without updating from the state.
Conclusion
Loading additional information is a common problem in streaming applications. With Spark Streaming, there are a number of ways to accomplish this.
The easiest way is to broadcast static data at the start of the application. For volatile data, read per partition is easy to implement and provides a solid performance. With the use of the Spark states, the speed can be increased further, but it is more complex to develop.
Optimally, the data is always directly present on the worker node, on which the data is processed. This is the case, for example, with the use of Spark states. Kafka streams pursue this approach even more consistently. Here, a table is treated as a stream and – provided the streams are identical partitioned – distributed in the same way as the original stream. This makes local lookups possible.
Apache Flink is also working on efficient lookups, here under the title Side Inputs .
More articles
fromMatthias Niehoff
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Blog author
Matthias Niehoff
Head of Data
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.