The SMACK stack – hands on!

1.5.2016 | 9 minutes reading time

The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools.
In the first step, we will setup a DC/OS-managed Mesos datacenter on AWS. This is going to provide the basic infrastructure on which we will then install Kafka, Cassandra and Spark. Then we are going to deploy simple ingestion and digestion jobs that pump data right across the toolchain.

Step 0: Prerequisites

If you want to try the examples yourself, you’ll need the following:

Vagrant and Virtualbox
Ansible
An AWS Account. The AWS resources used in this example exceed the free tier, regular AWS fees apply. You have been warned!
An EC2 key pair called dcos-intro
A Twitter access key

The source code is available at https://github.com/ftrossbach/intro-to-dcos .

Step 1: sMack stands for Mesos

DC/OS is Mesosphere’s distributed operating system on top of Apache Mesos. Version 1.7 has just been released as open source . DC/OS provides AWS CloudFormation templates to set up the datacenter. The following diagram gives a high level overview of the resources that are created for the example in this blog post:

The CloudFormation template creates these AWS resources

In the public subnet that is reachable from the internet, we can see a master node and a public Mesos slave. One master is sufficient as there is no need for high availability (HA). DC/OS also provides templates that provision three masters and thus provide HA capability. The private subnet is only accessible from the public subnet. It contains 5 mesos slaves that will run any application that does not need exposure to the outside world. If you want to know more, you can have a look at my colleague Bernd Zuther’s Github repository . Bernd gives a more detailed overview of DC/OS. He also transformed the CloudFormation template into a Terraform template that is more easily readable.

To get the stack running, we need to check out the project at Github . We have also to provide AWS credentials for AWS in form of the environment variables:

DCOS_INTRO_AWS_REGION
DCOS_INTRO_AWS_ACCESS_KEY
DCOS_INTRO_AWS_SECRET_KEY

Calling vagrant up then sets up a virtual Ubuntu that will serve as our gateway to DC/OS. After the machine is created, the Ansible CloudFormation task takes the template and creates the Stack in AWS. This is happening synchronously, so it will take a couple of minutes. After Ansible is done, we can take a look at our brand new instances:

The instances that have been created in AWS as shown in the AWS console

Ansible also prints the URL of the master load balancer:

TASK [dcos-cli : This is the URL of your master ELB] ****************
ok: [default] => {
    "msg": "http://dcos-test-ElasticL-***.elb.amazonaws.com"
}

The most important URLs for accessing your datacenter are:

DC/OS dashboard: http:///
The task scheduling framework Marathon: http:///marathon
Plain Mesos: http:///mesos

Let’s take a look at our brand new DC/OS dashboard:

The DC/OS dashboard that is available at http://

As we have not yet deployed any applications, all resources are still up for grabs.

Now that we have Mesos running in form of DC/OS, we are going to install the remaining SMACK infrastructure components.

Step 2: smaCk stands for Cassandra

Ansible already installed the DC/OS CLI in our Vagrant box. DC/OS CLI is a command line interface to manage applications in the datacenter. From within the Vagrant box (use vagrant ssh), the Cassandra Mesos framework can be installed by issuing one simple shell command:

dcos package install cassandra

This will start three Cassandra nodes with some defaults regarding memory and disk requirements. Configuration options are described at https://dcos.io/docs/1.7/usage/tutorials/cassandra/ .

If you look at the DC/OS dashboard, you might see that Cassandra reports as unhealthy. It should turn to healthy once all Cassandra nodes are up and running. You can also now see that some resources have been allocated in DC/OS.

Setting up a Cassandra keyspace is unfortunately not yet possible via the DC/OS CLI. We need to log onto one of our master nodes using SSH and the private key of key pair “dcos-intro”:

ssh core@<mesos_master_ip> -i dcos-intro.pem

We use docker to start a cqlsh Cassandra command line:

docker run -ti cassandra:2.2.5 cqlsh node-0.cassandra.mesos

Once in cqlsh, we can create the required keyspace and table:

CREATE KEYSPACE IF NOT EXISTS dcos WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': '3'};
CREATE TABLE IF NOT EXISTS dcos.tweets(date timestamp, text text, PRIMARY KEY (date, text));

Yes, this is going to be yet another Twitter-based example.

Step 3: smacK stands for Kafka

Similarly to Cassandra, we are also going to use command line tools to set up Kafka.
Issuing the command

dcos package install --package-version=0.9.4.0 kafka

installs the DC/OS Kafka service. We are using version 0.9.4.0 of this service because the most recent version that was released with DC/OS 1.7 caused some incompatibilities with the frameworks used in the example.

Compared to Cassandra, this the installation of the framework will not lead to an automatic startup of Kafka brokers. We are going to use the DC/OS Kafka command line to add and start Kafka brokers:

dcos kafka broker add 0..2 --port 9092

This command adds three brokers – broker-0, broker-1 and broker-2 – that are supposed to run on port 9092. Further configuration options are available at https://docs.mesosphere.com/usage/services/kafka/.
Adding brokers is not the same as starting them, issuing

dcos kafka broker start 0..2

does that job.

We also use the CLI to create a Kafka topic very creatively called “topic” with ten partitions and a replication factor of three:

dcos kafka topic add topic --partitions 10 --replicas 3

Step 4: Smack stands for Spark

You can probably guess by now how to install Spark, but here it is anyway:

dcos package install spark

This sets up Spark on Mesos in so called cluster mode . Spark jobs will be submitted to a MesosClusterDispatcher that will distribute Spark jobs on demand to available cluster resources. There are no dedicated Spark executors!

This completes the installation of the SMACK infrastructure. A look at the Mesos Web UI reveals all running tasks:

The Mesos Web UI at http:///mesos shows the running tasks

The example application

By now we have all the infrastructure available to run a SMACK application, so let’s just do that.
The example application streams tweets via Kafka and Spark into Cassandra. It is logically divided into two parts:

data ingestion
data digestion

We’ll see more about how these parts are set up in the following sections.

Step 5: smAck stands for Akka – data ingestion

The application reads tweets using the twitter4j library and pushes them onto Kafka using Akka reactive streams and the reactive kafka extensions. The code is available in the Github repository and we will not go into detail on this as it is a straightforward Akka stream implementation. Much more interesting is the deployment aspect. The application is packaged as a docker container and deployed to a public docker registry at https://hub.docker.com/r/ftrossbach/dcos-intro/ . To deploy the app in our datacenter, we use DC/OS’ scheduler Marathon and the DC/OS CLI. A new deployment in Marathon is defined in JSON. For the ingestion, we use the following configuration:

{
 "id": "ingestion",
 "container":{
   "docker":{
     "forcePullImage":true,
     "image":"ftrossbach/dcos-intro",
     "network":"BRIDGE"
     "privileged":false
   },
   "type":"DOCKER"
 },
 "cpus":1,
 "mem":2048,
 "env":{
   "KAFKA_HOSTS":"broker-0.kafka.mesos",
   "KAFKA_PORT":"9092",
   "twitter4j.debug":"true",
   "twitter4j.oauth.accessToken":"****",
   "twitter4j.oauth.accessTokenSecret":"****",
   "twitter4j.oauth.consumerKey":"****",
   "twitter4j.oauth.consumerSecret":"****"
 }
}

In this file, we specify the Docker container we want to deploy, some variables like the number of CPUs or memory, and most importantly some environment variables for the container.
We trigger the deployment by issuing

dcos marathon app add /vagrant/provisioning/ingestion.json

– you will need to include valid Twitter credentials in that file.

If you want to check the status of the application, you can take a look into Marathon at http:///marathon. The plain Mesos UI at http:///mesos also has this information. You have access to log files in both UIs.

The Marathon page for the ingestion application is reachable at http:///marathon/ui/#/apps/ingestion

If the application reports as “running”, you’re fine. If it is stuck in a deploy-wait-cycle, check stderr.

Step 6: Smack stands for Spark (yet again…) – data digestion

The final step is running the data digestion. In Kafka, we have a distributed high performance message broker, but we need to do something with the data as Kafka itself does not offer any analytic capabilities whatsoever. We use a Spark streaming job that is also included in the example (de.codecentric.dcos_intro.spark.SparkJob). To keep things uncomplicated, this example does not really make use of the true power of Spark but plainly writes the tweets it reads from Kafka into a Cassandra table.

To deploy a Spark job using the DC/OS CLI, the jar must be available in a place where DC/OS can access it (https://dcos.io/docs/1.7/usage/tutorials/spark/ ). A public file on AWS S3 will do the trick for us. Once uploaded, the job can be triggered by issuing

dcos spark run --submit-args=' \ 
--class de.codecentric.dcos_intro.spark.SparkJob \
https://s3.eu-central-1.amazonaws.com/dcos-intro/dcos-intro-assembly-1.0.jar \ 
topic node-0.cassandra.mesos 9042 \ 
broker-0.kafka.mesos:9092'

The Spark Mesos dispatcher has a WebUI that lets us monitor the job status. It is accessible at
http:///service/spark.

Once the job is running, we can log into Cassandra using the same docker command as above when we created the keyspace and query the tweets table (extract):

This shows the results of a Cassandra query

Conclusion and outlook

Congratulations, you got the SMACK stack up and running and deployed an application.
This post gave you basic steps to setup the stack on AWS and run simple ingestion and digestion jobs. This is great to explore the features of DC/OS and how it enables you to run a fast data application, but of course there are many things left out – our example application is neither fast data nor big data.

If you want to set up the stack productively, you might also want to:

check out the brand new DC/OS homepage
read up on Marathon
explore service discovery via Mesos DNS
think about the number of Kafka and Cassandra nodes as well as their memory requirements
restrict access to DC/OS – with this template, the master nodes are accessible to everyone on the internet
explore the options of the DC/OS CLI (dcos --help from within the Vagrant box)
have a look of other official DC/OS packages in the Mesosphere universe

In order to clean up, you can delete the Stack on AWS in the CloudFormation console. This will destroy all created resources.

Was this post helpful?

Blog author

Florian Troßbach

Do you still have questions? Just send me a message.

fromFlorian Troßbach

Living on the edge: building serverless applications with Cloudflare Workers

Cloudflare is best known for its CDN, DNS server (1.1.1.1) or WAF/DDos mitigation services. These services are highly predicated on “Edge Computing”, bringing data closer to the user interested in those services – a user in Australia will be happier ...

Cloud native
Cloud
Serverless

28.11.2024 | 12 minutes reading time

Florian Troßbach

Validating Topic Configurations in Apache Kafka

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong? Why does Kafka topic configuration matter...

Messaging
Big Data

7.12.2017 | 8 minutes reading time

Florian Troßbach

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries are a fairly new feature of Apache Kafka Streams that provides programmatic access to the internal state held by a streaming application. However, the Kafka API only provides access to the state that is held locally by an instance...

Messaging
Java

20.3.2017 | 9 minutes reading time

Florian Troßbach

Interactive Queries in Apache Kafka Streams

"Databases? Where we're going we don't need databases" – Doc Brown, 1985 Well, we’re certainly not there yet, but this article is going to introduce you to a new feature of the popular streaming platform Apache Kafka that can make a dedicated external...

Messaging
Streaming

13.3.2017 | 10 minutes reading time

Florian Troßbach

Crossing the Streams – Joins in Apache Kafka

Version 0.10.0 of the popular distributed streaming platform Apache Kafka saw the introduction of Kafka Streams. In its initial release, the Streams-API enabled stateful and stateless Kafka-to-Kafka message processing using concepts such as map, flatMap...

Messaging
Big Data
Streaming

15.2.2017 | 14 minutes reading time

Florian Troßbach

Realtime Fast Data Analytics with Druid

I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving...

18.8.2016 | 13 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 2

This is part 2 of a series that aims to get you started with project Jigsaw. In part 1 , we briefly talked about the definition of a module and how the Java Runtime was modularized. We then proceeded to a simple example that demonstrated how to (and ...

Java

1.12.2015 | 12 minutes reading time

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 1

Eight years after its inception, Project Jigsaw – the modularization of the Java platform and introduction of a general module system – is on track to be included in Java 9. The target release has changed over the years from Java 7 via Java 8 to Java...

Java

24.11.2015 | 11 minutes reading time

Florian Troßbach

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

The SMACK stack – hands on!

Step 0: Prerequisites

Step 1: sMack stands for Mesos

Step 2: smaCk stands for Cassandra

Step 3: smacK stands for Kafka

Step 4: Smack stands for Spark

The example application

Step 5: smAck stands for Akka – data ingestion

Step 6: Smack stands for Spark (yet again…) – data digestion

Conclusion and outlook

Was this post helpful?

Blog author

More articles

Living on the edge: building serverless applications with Cloudflare Workers

Validating Topic Configurations in Apache Kafka

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries in Apache Kafka Streams

Crossing the Streams – Joins in Apache Kafka

Realtime Fast Data Analytics with Druid

First steps with Java 9 and Project Jigsaw – Part 2

First steps with Java 9 and Project Jigsaw – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)