Realtime Fast Data Analytics with Druid

18.8.2016 | 13 minutes reading time

I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving it. Unless, of course, your data analysts come up with new queries for which there are no optimized Cassandra tables. In this case, you have to make a choice. You could use Spark itself to do those computations dynamically, for example with a notebook tool like Zeppelin. Depending on your data, this might take a while. Or you can store the data suitably during digestion in Cassandra – taking the penalty of duplicated data. And with the next new query, the cycle starts again…

Druid promises Online Analytical Processing (OLAP) capability in fast data contexts with realtime stream ingestion and sub-second queries. In this post, I am going to introduce the basic concepts behind Druid and show the tools in action.

Druid Segments

The equivalent to a table in Druid is called “data source”. The data source defines how data is stored and sharded.

Druid relies on the time dimension for every query, so data sets require a timestamp column. Other than the primary timestamp, data sets can contain a set of dimensions and metrics (measures in OLAP). Let’s use an example. We’re tracking registrations for events – the time the registration occurred, the name of the event, the country the event takes place in and the number of guests registered in that reservation:

timestamp	event name	country	guests
2016-08-14T14:01:35Z	XP Cologne 2017	Germany	12
2016-08-14T14:01:45Z	XP Cologne 2017	Germany	2
2016-08-14T14:02:36Z	data2day 2016	Germany	1
2016-08-14T14:02:55Z	JavaOne	USA	9

“Event name” and “country” are dimensions – these are data fields that analysts might want to filter on. “guests” is a metric – we can do calculations on that field such as “how many guests have registered for the event “XP Cologne 2017”? “How many guests are attending conferences in Germany”?

Druid stores the data in immutable “segments”. A segment contains all data for a configured period of time. Per default, a segment contains data for a day, but this can be set higher or lower. So depending on the configuration, the data above could be in a single segment if the segment granularity is “day”, or in two different segments if the granularity is “minute”. Next to segment granularity, there is also “query granularity”. This defines how data is rolled up in Druid. With the above example, an analyst might not care about single registration events in a segment but about the aggregation. So going by the query granularity of minute, we would get the following result for the data above:

timestamp	event name	country	guests
2016-08-14T14:01:00Z	XP Cologne 2017	Germany	14
2016-08-14T14:02:00Z	data2day 2016	Germany	1
2016-08-14T14:02:00Z	JavaOne	USA	9

Data rollup can save you quite a bit in storage capacity, but of course you lose the ability to query individual events. Now that we know about basic data structures in Druid, let’s look at the components that make a Druid cluster.

Cluster components

Well, I never promised that this was going to be easy. Druid requires quite a few components and external dependencies that need to work together. Let’s step through them:

Historical Nodes

These are nodes that do three things really well: loading immutable data segments from deep storage, dropping those segments and serving queries about loaded segments. In a production environment, deep storage will typically be Amazon S3 or HDFS. A historical node only cares about its own loaded segments. It does not know anything about any other segment. How does it know what segments to load? For that, we have …

Coordinator nodes

Coordinator nodes are the housekeepers in a Druid cluster. They make sure that all serviceable segments are loaded by at least one Historical Node – depending on configured replication factors. It also makes sure Historicals drop no longer needed segments and rebalances segments for somewhat even distribution within the cluster. So that’s historical data covered. The communication between Historicals and Coordinators is happening indirectly using another external dependency: Apache Zookeeper which you may already know from Kafka or Mesos. You can run multiple Coordinators for high availability, a leader will be elected.

So what about the promised realtime ingestion?

Realtime Ingestion

While Druid can be used to ingest batches of data, realtime data ingestion is one of its strong points. At the moment, there are two options available for realtime ingestion. The first one are Realtime Nodes, but I will not go into detail about them because they seem to be on their way out . Superseding Realtime nodes is the Indexing Service. Now, the Indexing Service is a bit more involved in itself. It is centered around a notion of tasks. The components that accepts tasks from the outside is the “Overlord”. It then forwards those tasks to so-called “Middle Managers”. These nodes in turn spawn so-called “Peons” who run in a separate JVM and can only run one task at a time. A Middle Manager can run a specified number of Peons. If all Middle Managers are occupied and the Overlord is not able to assign a new task anywhere, there is some autoscaling capability built-in to create new Middle Managers in suitable environments (e.g. AWS). So for realtime ingestion, a task is created for every new segment that is to be created. Data becomes queryable immediately after ingestion. The segments are announced in the third external dependency – a *gasp* relational database. The database contains the metadata about the segment and some rules that decide when this data should be loaded or dropped by Historicals. Once a segment is complete, it is handed off to deep storage, metadata is written and the task is completed.

The Indexing Service API is a bit low level. To make working with it easier, the “Tranquility” project seems to have established itself as the entrypoint for realtime ingestion. Tranquility servers can provide HTTP endpoints for data ingestion or read data from Kafka.

OK then, we got the data ingested and store. How can you query it?

Broker Nodes

Broker Nodes are your gateway to the data stored in Druid. They know which segments are available on which nodes (from Zookeeper), query the responsible nodes and merge the results together. The primary query “language” is a REST API, but tool suites like the Imply Analytics Platform extend that support with tools like Pivot and PlyQL. We will see those later.

As you can hardly be expected to remember all this prose, here is a diagram that shows the components and their relations:

Example

In this example, we’ll set up a Druid cluster that ingests RSVP events from Meetup.com – these are published on a very accessible WebSocket API .

Instead of starting each Druid component on its own, we’ll take a shortcut to set up the Imply Analytics Platform mentioned above. That way we won’t gain too much insight about the inner workings, but we’ll get our hands on a working “cluster”. I put cluster in quotes because we will run all processes on a single machine. Deep storage will be simulated by the local file system and the RDBMS is an in-memory Derby.

The data

Let’s take a look at the data first. We receive events of the following type from Meetup:

1{  
2   "venue":{  
3      "venue_name":"Rosslyn Park FC",
4      "lon":-0.226583,
5      "lat":51.462914,
6      "venue_id":24685117
7   },
8   "visibility":"public",
9   "response":"no",
10   "guests":0,
11   "member":{  
12      "member_id":4711,
13      "member_name":"Paula"
14   },
15   "rsvp_id":1624399275,
16   "mtime":1471357877547,
17   "event":{  
18      "event_name":"Pre-season training",
19      "event_id":"233241355",
20      "time":1471457700000,
21      "event_url":"http:\/\/www.meetup.com\/Rosslyn-Park-Womens-Rugby\/events\/233241355\/"
22   },
23   "group":{  
24      "group_topics":[  
25         {  
26            "urlkey":"rugby",
27            "topic_name":"Rugby"
28         },
29         {  
30            "urlkey":"playing-rugby",
31            "topic_name":"Playing Rugby"
32         }
33      ],
34      "group_city":"London",
35      "group_country":"gb",
36      "group_id":20231882,
37      "group_name":"Rosslyn Park Womens Rugby",
38      "group_lon":-0.1,
39      "group_urlname":"Rosslyn-Park-Womens-Rugby",
40      "group_lat":51.52
41   }
42}

For this example, I decided that the following fields could be of interest:

mtime
- The time of the RSVP in milliseconds since the epoch
response
- The response of the RSVP – “yes” or “no”
guests
- The number of additional guests covered by this RSVP
event_name
- The name of the event
group_name
- The name of the Meetup group
group_city
- The city of the Meetup group
group_country
- The country of the Meetup group
member_name
- The name of the member issuing the RSVP
member_id
- The ID of the member issuing the RSVP
member.other_services.twitter.identifier
- The Twitter handle of the the member if available
venue.lat
- Latitude of the Vanue
venue.lon
- Longitude of the Vanue

A simple Akka http client connects to the WebSocket (see this Gist ) and transforms the data into a flat structure (expected by Druid). A single data row looks like this:

1{  
2   "rsvpTime":1471549945000,
3   "response":"yes",
4   "guests":0,
5   "eventName":"Missouri Patriot Paws",
6   "eventTime":1472601600000,
7   "groupName":"Meet me at the library. No library card needed!",
8   "groupCity":"O Fallon",
9   "groupCountry":"us",
10   "memberName":"Hans Dampf",
11   "memberId":4711,
12   "twitterName":null,
13   "venueName":"St. Charles County Library Middendorf-Kredell Branch",
14   "venueLat":38.767715,
15   "venueLong":-90.69902
16}

Simple local setup

Before we can send this data to Druid, we need to start it up. The guys at Imply make this really easy for us. Following their quickstart guide for a local setup, we just need to execute these commands:

1curl -O https://static.imply.io/release/imply-1.3.0.tar.gz
2tar -xzf imply-1.3.0.tar.gz
3cd imply-1.3.0
4bin/supervise -c conf/supervise/quickstart.conf

Yet before we do that, we need to tell Tranquility about our Meetup datasource. To do this, we edit conf-quickstart/tranquility/server.json to add the following datasource:

1{
2 "spec": {
3   "dataSchema": {
4     "dataSource": "meetup",
5     "parser": {
6       "type": "string",
7       "parseSpec": {
8         "timestampSpec": {
9           "column": "rsvpTime",
10           "format": "auto"
11         },
12         "dimensionsSpec": {
13           "dimensions": [
14             "response",
15             "eventName",
16             "eventTime",
17             "groupName",
18             "groupCity",
19             "groupCountry",
20             "memberName",
21             "memberId",
22             "twitterName",
23             "venueName"
24           ],
25           "dimensionExclusions": [
26             "guests",
27             "rsvpTime"
28           ],
29           "spatialDimensions": [
30             {
31               "dimName": "venueCoordinates",
32               "dims": [
33                 "venueLat",
34                 "venueLong"
35               ]
36             }
37           ]
38         },
39         "format": "json"
40       }
41     },
42     "granularitySpec": {
43       "type": "uniform",
44       "segmentGranularity": "hour",
45       "queryGranularity": "none"
46     },
47     "metricsSpec": [
48       {
49         "type": "count",
50         "name": "count"
51       },
52       {
53         "name": "guestsSum",
54         "type": "doubleSum",
55         "fieldName": "guests"
56       },
57       {
58         "fieldName": "guests",
59         "name": "guestsMin",
60         "type": "doubleMin"
61       },
62       {
63         "type": "doubleMax",
64         "name": "guestsMax",
65         "fieldName": "guests"
66       }
67     ]
68   },
69   "ioConfig": {
70     "type": "realtime"
71   },
72   "tuningConfig": {
73     "type": "realtime",
74     "maxRowsInMemory": "100000",
75     "intermediatePersistPeriod": "PT10M",
76     "windowPeriod": "PT10M"
77   }
78 },
79 "properties": {
80   "task.partitions": "1",
81   "task.replicants": "1"
82 }
83}

This spec basically tells Tranquility and Druid what fields in our dataset are event timestamps, dimensions and metrics. We also can aggregate latitude and longitude to a spatial dimension. In the “granularity” section, we specify that we want our segments to cover an hour and query granularity to be none – this preserves single records and is nice for our example, but probably not what you would do in production. After editing the file, we can start Druid with the steps shown above. Once running, we can ingest data by posting http requests to http://localhost:8200/v1/post/meetup. To check if the meetup ingestion is running, we can open the Overlord console:

Accessing the data

So now we’re ingesting the data, how can we get to it? I am going to show three different ways.

Druid queries

The basic way to get data out of Druid is to run a plain Druid query. This means posting the query in JSON format against the broker node. If for example we would like to get the count of all RSVPs from Germany or the US per hour in a specified period, we’d post the following data to http://localhost:8082/druid/v2/?pretty:

1{
2  "queryType": "timeseries",
3  "dataSource": "meetup",
4  "granularity": "hour",
5  "descending": "true",
6  "filter": {
7    "type": "or",
8    "fields": [
9      { "type": "selector", "dimension": "groupCountry", "value": "de" },
10      { "type": "selector", "dimension": "groupCountry", "value": "us" }
11    ]
12  },
13  "aggregations": [
14    { "type": "longSum", "name": "rsvpSum", "fieldName": "count" }
15  ],
16  "postAggregations": [
17 
18  ],
19  "intervals": [ "2016-08-14T00:00:00.000/2016-08-20T00:00:00.000" ]
20}

This yields the following response (extract):

1[
2  {
3    "timestamp": "2016-08-18T20:00:00.000Z",
4    "result": {
5      "rsvpSum": 81
6    }
7  },
8  {
9    "timestamp": "2016-08-18T19:00:00.000Z",
10    "result": {
11      "rsvpSum": 249
12    }
13  },
14  {
15    "timestamp": "2016-08-17T11:00:00.000Z",
16    "result": {
17      "rsvpSum": 316
18    }
19  }
20]

The queries that Druid performs best are time series and TopN queries. The documentation gives you an insight about what is possible.

Pivot

Imply provides “Pivot” at http://localhost:9095. Pivot is a GUI for analyzing a Druid datasource and is very accessible. You are greeted by something like this:

If we want to see the events with the biggest number of RSVPs in the last week including the split between “yes” and “no”, we certainly can do that:

We can also look at the raw data:

Playing around with Pivot to get a feeling for the tool and your data is certainly fun and works like this out of the box. Pivot is based on “Plywood” – a Javascript library as integration layer between Druid data and visualization frontends that is also part of Imply.

PlyQL

Another part of Imply is “PlyQL” . As you can imagine from the name, it aims to provide SQL-like access to the data. Regarding our Meetup platform, we start by looking at the set of tables that we can query:

1bin/plyql --host localhost:8082 -q 'SHOW TABLES'
2┌────────────────────┐
3│ Tables_in_database │
4├────────────────────┤
5│ COLUMNS            │
6│ SCHEMATA           │
7│ TABLES             │
8│ meetup             │
9└────────────────────┘

Describing the table “meetup” gives the following overview:

1bin/plyql --host localhost:8082 -q 'DESCRIBE meetup'                                                                                             
2┌──────────────────┬────────┬──────┬─────┬─────────┬───────┐
3│ Field            │ Type   │ Null │ Key │ Default │ Extra │
4├──────────────────┼────────┼──────┼─────┼─────────┼───────┤
5│ __time           │ TIME   │ YES  │     │         │       │
6│ count            │ NUMBER │ YES  │     │         │       │
7│ eventName        │ STRING │ YES  │     │         │       │
8│ eventTime        │ STRING │ YES  │     │         │       │
9│ groupCity        │ STRING │ YES  │     │         │       │
10│ groupCountry     │ STRING │ YES  │     │         │       │
11│ groupName        │ STRING │ YES  │     │         │       │
12│ guestsMax        │ NUMBER │ YES  │     │         │       │
13│ guestsMin        │ NUMBER │ YES  │     │         │       │
14│ guestsSum        │ NUMBER │ YES  │     │         │       │
15│ memberId         │ STRING │ YES  │     │         │       │
16│ memberName       │ STRING │ YES  │     │         │       │
17│ response         │ STRING │ YES  │     │         │       │
18│ twitterName      │ STRING │ YES  │     │         │       │
19│ venueCoordinates │ STRING │ YES  │     │         │       │
20│ venueName        │ STRING │ YES  │     │         │       │
21└──────────────────┴────────┴──────┴─────┴─────────┴───────┘

Finding the five events where the positive RSVPs have the highest average of additional guests is possible using this query:

1in/plyql --host localhost:8082 -q \
2'SELECT eventName,  avg(guestsSum) as avgGuests \
3 FROM meetup \
4 WHERE "2015-09-12T00:00:00" <= __time \
5   AND __time < "2016-09-13T00:00:00" \
6   AND response = "yes" \
7 GROUP BY eventName \
8 ORDER BY avgGuests DESC \
9 LIMIT 5'
10┌────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
11│ eventName                                                                                      │ guests             │
12├────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
13│ 8/19 NEC Business & Social Networking Meetup at the JW Marriott Hotel                          │ 69                 │
14│ Southwest Suburbs Sports, Outdoors, & Social Group Fall Picnic                                 │ 63                 │
15│ Chicago Caming, Canoeing, and Outdoors Adventure Group Fall Picnic                             │ 60                 │
16│ Saturday, September 10th 2016 Dance @ Dance New York Studio!                                   │ 49.5               │
17│ Mingle & 90s/00s Piccadilly Party with 1 x FREE FOOD!! & Happy Hour until 9pm                  │ 39                 │
18│ Calpe Beach and Guadalest Castle (Option to climb Penyon de Ifach)                             │ 30                 
19└────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┘

We could also use the -o switch to get the output of our queries as JSON.

Summary

This concludes our quick walkthrough of Druid. We talked about the basic concepts of Druid, ran an ingestion of realtime Meetup data and looked at ways to access the data with plain Druid and the very interesting Imply Analytics Platform. Druid is a very promising piece of technology that warrants an evaluation if you’re trying to run OLAP queries on realtime fast data.

For further reading, I suggest:

Was this post helpful?

Blog author

Florian Troßbach

Do you still have questions? Just send me a message.

fromFlorian Troßbach

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 12 minutes reading time

Florian Troßbach

Validating Topic Configurations in Apache Kafka

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong? Why does Kafka topic configuration matter...

Messaging
Big Data

7.12.2017 | 8 minutes reading time

Florian Troßbach

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries are a fairly new feature of Apache Kafka Streams that provides programmatic access to the internal state held by a streaming application. However, the Kafka API only provides access to the state that is held locally by an instance...

Messaging
Java

20.3.2017 | 9 minutes reading time

Florian Troßbach

Interactive Queries in Apache Kafka Streams

"Databases? Where we're going we don't need databases" – Doc Brown, 1985 Well, we’re certainly not there yet, but this article is going to introduce you to a new feature of the popular streaming platform Apache Kafka that can make a dedicated external...

Messaging
Streaming

13.3.2017 | 10 minutes reading time

Florian Troßbach

Crossing the Streams – Joins in Apache Kafka

Version 0.10.0 of the popular distributed streaming platform Apache Kafka saw the introduction of Kafka Streams. In its initial release, the Streams-API enabled stateful and stateless Kafka-to-Kafka message processing using concepts such as map, flatMap...

Messaging
Big Data
Streaming

15.2.2017 | 14 minutes reading time

Florian Troßbach

The SMACK stack – hands on!

The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools. In the first step,...

1.5.2016 | 9 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 2

This is part 2 of a series that aims to get you started with project Jigsaw. In part 1 , we briefly talked about the definition of a module and how the Java Runtime was modularized. We then proceeded to a simple example that demonstrated how to (and ...

Java

1.12.2015 | 12 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 1

Eight years after its inception, Project Jigsaw – the modularization of the Java platform and introduction of a general module system – is on track to be included in Java 9. The target release has changed over the years from Java 7 via Java 8 to Java...

Java

24.11.2015 | 11 minutes reading time

Florian Troßbach

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Realtime Fast Data Analytics with Druid

Druid Segments

Cluster components

Historical Nodes

Coordinator nodes

Realtime Ingestion

Broker Nodes

Example

The data

Simple local setup

Accessing the data

Druid queries

Pivot

PlyQL

Summary

Was this post helpful?

Blog author

More articles

Living on the edge: building serverless applications with Cloudflare Workers

Validating Topic Configurations in Apache Kafka

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries in Apache Kafka Streams

Crossing the Streams – Joins in Apache Kafka

The SMACK stack – hands on!

First steps with Java 9 and Project Jigsaw – Part 2

First steps with Java 9 and Project Jigsaw – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)