Spark 2.0 – Datasets and case classes

27.7.2016 | 7 minutes reading time

The brand new major 2.0 release of Apache Spark was given out two days ago. One of its features is the unification of the DataFrame and Dataset APIs. While the DataFrame API has been part of Spark since the advent of Spark SQL (they replaced SchemaRDDs), the Dataset API was included as a preview in version 1.6 and aims at overcoming some of the shortcomings of DataFrames in regard to type safety.

This post has five sections:

The problem (roughly): States the problem in a rough fashion.
DataFrames versus Datasets: Quick recall on DataFrames and Datasets.
The problem (detailed): Detailed statement of the problem.
The solution: Proposes a solution to the problem.
It concludes with a Summary.

The problem (roughly)

The question this blog post addresses is roughly (for details see below): Given a Dataset, how can one append a column to it containing values derived from its columns without passing strings as arguments or doing anything else that would spoil the type safety the Dataset API can provide?

DataFrames versus Datasets

DataFrames have their origin in R and Python (Pandas), where they have proven to give a concise and practical programming interface for working with tabular data with a fixed schema. Due to the popularity of R and Python among Data Scientists, the DataFrame concept already has a certain degree of familiarity within these circles. Something that certainly allowed Spark to gain more users coming from this side. But the advantages of DataFrames do not only exist on the API side. There are also significant performance improvements as opposed to plain RDDs due to the additional structure information available which can be used by Spark SQL and Spark’s own Catalyst Optimizer.

Within the DataFrame API a tabular data set used to be described as an RDD consisting of rows with a row being an instance of type Array[Any]. Thus DataFrames basically do not take the data types of the column values into account. In contrast to this, the new Dataset API allows modelling rows of tabular data using Scala’s case classes.

While DataFrames are more dynamic in their typing, Datasets combine some of the benefits of Scala’s type checking with those of DataFrames. This can help to spot errors at an early stage but certain operations (see next section for an example) on Datasets still rely on passing column names in as String arguments rather than working with fields of an object.

This raises the question whether some of these operations can also be expressed within the type safe parts of the Datasets API alone, thus keeping the newly gained benefits of using the type system. As we will see in a particular example this requires some discipline and working with traits to circumvent a problem with inheritance that arises with case classes.

The problem (detailed)

The first lines of our exemplary CSV file bodies.csv look as follows:

id	width	height	depth	material	color
1	1.0	1.0	1.0	wood	brown
2	2.0	2.0	2.0	glass	green
3	3.0	3.0	3.0	metal	blue

Reading CSV files like this becomes much easier beginning with Spark 2.0. A SparkSession provides a fluent API for reading and writing. We can do as follows:

1val df: DataFrame = spark.read
2                         .schema(schema)
3                         .option("header", true)
4                         .csv("/path/to/bodies.csv")

Spark is able to infer the schema automatically in most cases by passing two times over the input file. In our case it would infer all columns as of type String. To help with that, we programmatically declare the schema as follows before the above code:

1val id       = StructField("id",       DataTypes.IntegerType)
2val width    = StructField("width",    DataTypes.DoubleType)
3val height   = StructField("height",   DataTypes.DoubleType)
4val depth    = StructField("depth",    DataTypes.DoubleType)
5val material = StructField("material", DataTypes.StringType)
6val color    = StructField("color",    DataTypes.StringType)
7 
8val fields = Array(id, width, height, depth, material, color)
9val schema = StructType(fields)

DataFrames outperform plain RDDs across all languages supported by Spark and provide a comfortable API when it comes to working with structured data and relational algebra. But they provide weak support when it comes to types. There are mainly two reasons:

For one thing, many operations on DataFrames involve passing in a String. Either as column name or as query. This is prone to error. For example df.select(“colour”) would pass at compile time and would only blow a likely long running job at run time.
A DataFrame is basically a RDD[Row] where a Row is just an Array[Any].

Spark 2.0 introduces Datasets to better address these points. The take away message is that instead of using type agnostic Rows, one can use Scala’s case classes or tuples to describe the contents of the rows. The (not so) magic gluing is done by using as on a Dataframe. (Tupels would match by position and also lack the possibility to customize naming.)

1final case class Body(id: Int, 
2                      width: Double, 
3                      height: Double, 
4                      depth: Double, 
5                      material: String, 
6                      color: String)
7 
8val ds = df.as[Body]

The matching between the DataFrames columns and the fields of the case class is done by name and the types should match. In summary, this introduces a contract and narrows down possible sources of error. For example, one immediate benefit is that we can access fields via the dot operator and get additional IDE support:

1val colors = ds.map(_.color) // Compiles!
2ds.map(_.colour)             // Typo - WON'T compile!

Further, we can use this feature and the newly added type-safe aggregation functions to write queries with compile time safety:

1import org.apache.spark.sql.expressions.scalalang.typed.{
2  count => typedCount, 
3  sum => typedSum}
4 
5ds.groupByKey(body => body.color)
6  .agg(typedCount[Body](_.id).name("count(id)"),
7       typedSum[Body](_.width).name("sum(width)"),
8       typedSum[Body](_.height).name("sum(height)"),
9       typedSum[Body](_.depth).name("sum(depth)"))
10  .withColumnRenamed("value", "group")
11  .alias("Summary by color level")
12  .show()

If we wanted to compute the volume of all bodies, this would be quite straightforward in the DataFrame API. Two solutions come to mind:

1// 1. Solution: Using a user-defined function and appending the results as column
2val volumeUDF = udf {
3 (width: Double, height: Double, depth: Double) => width * height * depth
4}
5 
6ds.withColumn("volume", volumeUDF($"width", $"height", $"depth"))
7 
8// 2. Solution: Using a SQL query
9spark.sql(s"""
10           |SELECT *, width * height * depth
11           |AS volume
12           |FROM bodies
13           |""".stripMargin)

But this would throw us back again to working with strings again. How could a solution with case classes look like? Of course, more work might be involved here but keeping type support could be a rewarding benefit in crucial operations.

While case classes are convenient in many regards they do not support inheritance (Link ). So we cannot declare a case class BodyWithVolume that extends Body with an additional volume field. Assuming we had such a class, we could do this:

1ds.map { 
2 body => 
3  val volume = body.width * body.height * body.depth
4  BodyWithVolume(body.id, body.width, body.height, body.depth, body.material, body.color, volume)
5}

This would of course solve our problem of adding the volume as new field and mapping a Dataset onto a new Dataset but as said, case classes do not support inheritance. Of course, no one could prevent us from declaring the classes Body and BodyWithVolume independently without the latter extending the former. But this certainly feels awkward given their close relationship.

The solution

Are we out of luck? Not quite. We can extend both classes starting from some common traits:

1trait IsIdentifiable {
2 def id: Int
3}
4 
5trait HasThreeDimensions {
6 def width: Double
7 def height: Double
8 def depth: Double
9}
10 
11trait ConsistsOfMaterial {
12 def material: String
13 def color: String
14}
15 
16trait HasVolume extends HasThreeDimensions {
17 def volume = width * height * depth
18}
19 
20final case class Body(id: Int, 
21                      width: Double, 
22                      height: Double, 
23                      depth: Double, 
24                      material: String, 
25                      color: String) extends 
26                      IsIdentifiable with 
27                      HasThreeDimensions with 
28                      ConsistsOfMaterial
29 
30final case class BodyWithVolume(id: Int, 
31                                width: Double, 
32                                height: Double, 
33                                depth: Double, 
34                                material: String, 
35                                color: String) extends 
36                                IsIdentifiable with 
37                                HasVolume with 
38                                ConsistsOfMaterial

This indisputably introduces more code. Still, if one takes into account that at a later stage there might be a need to compute densities for bodies, etc., this might be a good foundation, especially if type safe queries might be a concern.

Another limitation one certainly has to face, when working with Datasets is that currently a case class can only have 22 parameters, making it hard to work with say CSV files having 23 columns. The same holds for tuples instead of case classes.

Inheritance vs. composition

Of course, another solution could consist in using composition instead of inheritance. That is, we could work with Body as above and in addition final case class BodyWithVolume(body: Body, volume: Double). This would nest things.

Summary

In this blog post, we had a quick glimpse at the new Dataset API, saw how one can create a Dataset from a CSV file and then perform basic operations on it using the dot operator and transformations like map. We also saw how the Dataset API allows to write type safe aggregations. Finally, we discussed how traits can help to model relations between case classes, which in turn can be used to add new derived columns to a given Dataset.

Links:

Was this post helpful?

Blog author

Daniel Pape

Do you still have questions? Just send me a message.

fromDaniel Pape

Matrix Factorization for Ad Recommendation

This blog post describes how matrix factorization can be applied to the problem of ad targeting. It draws from my experience of developing a machine-learning-based solution for this task for the real-time performance marketing company twiago together...

AWS
Data

14.3.2018 | 7 minutes reading time

Daniel Pape

Spam classification using Spark’s DataFrames, ML and Zeppelin (Part 1)

This is the first entry in a series of blog posts about building and validating machine learning pipelines with Apache Spark . Its main concern is to show how to explore data with Spark and Apache Zeppelin notebooks in order to build machine learning...

Scala
Big Data
Data
Machine Learning

22.6.2016 | 15 minutes reading time

Daniel Pape

Calculating Pi with Apache Spark

Apache Spark is a system for cluster computing and part of the increasingly popular SMACK stack . The aim of this blog post is to provide a beginners introduction on how to set up a mini Spark cluster of virtual machines (VMs) using Vagrant and to run...

Big Data
Machine Learning

16.4.2016 | 9 minutes reading time

Daniel Pape

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Spark 2.0 – Datasets and case classes

The problem (roughly)

DataFrames versus Datasets

The problem (detailed)

The solution

Summary

Was this post helpful?

Blog author

More articles

Matrix Factorization for Ad Recommendation

Spam classification using Spark’s DataFrames, ML and Zeppelin (Part 1)

Calculating Pi with Apache Spark

Your job at codecentric?

Agile Developer und Consultant (w/d/m)