SMACK stack from the trenches

19.1.2017 | 12 minutes reading time

This is going to be a sum-up of the experience gathered on various projects done with the SMACK stack. For details about the SMACK stack you might want to take a look at the following blog – The SMACK Stack – Hands on .
Apache Spark – the S in SMACK – is used for analysis of data – real time data streaming into the system or already stored data in batches.
Apache Mesos – the M in SMACK – is the foundation of the stack. All of the applications do run on it. In our cases we’ve been using Mesospheres DC/OS on top of Apache Mesos for the installation and administration of the stack and our own applications.
Lightbend’s Akka – the A in SMACK – is used for fast data stream processing. In most use cases it’s been either used for fast ingestion of data or for fast extraction through in-stream-processing.
Apache Cassandra – the C in SMACK – is a fast write and read storage for the fast-data-processing platform.
Apache Kafka – the K in SMACK – is the intermediate storage for streaming data. It helps to decouple in and out application logic while still being fast enough to add no overhead of time on the stream of data.

In those projects the architecture has looked roughly like this

The ingestion, implemented with Akka, is kind of like Enterprise Integration on steroids. Instead of having a lot of different connectors you’ll end up with just a few entry points, but doing so in a very, very fast way. For each of the projects it had been a requirement to have fast input and storage of the data as well as having that data visible in near real-time. That’s where Akka comes into play again – either being connected to Kafka for real-time streaming of data via Websockets or as connector for Cassandra.
All of the scenarios running this stack have been build on AWS as cloud provider. This made it especially easy to setup and tear down the stack for development.

DC/OS – Apache Mesos

Apache Mesos in combination with Mesospheres DC/OS is the foundation for working with the SMACK stack. The Mesos
kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and scheduling across entire datacenter and cloud environments.
On top of Mesos, Mesosphere’s DC/OS will give you a command line interface and a nice integration with Mesosphere’s Marathon. Especially the first one in combination with Ansible can be used to automate the setup of a whole cluster.
The following shell command installs the Cassandra framework used by DC/OS.

1dcos package install cassandra

In combination with the dcos command dcos package list it’s possible to verify the success of the installation.

While Apache Mesos is used as “Kernel”, Marathon is used as “init.d” on top of it. Marathon makes sure deployed applications are running successfully. In case of a failure those applications are restarted by Marathon. While Marathon takes care of long running tasks, Mesosphere’s Metronome is in charge of short running cron-like tasks.
Marathon and Metronome are so called Mesos frameworks. A framework takes care of reserving resources from Mesos, schedules the application to be launched and sometimes monitors those applications.

Apache Cassandra

The nice thing about DC/OS is that it is very easy to run Apache Cassandra via a specialized Mesos framework. This framework not only helps in installing a Cassandra cluster on top of Mesos, it also makes sure of handling recovery of failed instances. The configuration and sizing of the Cassandra is crucial for having a high performance fast data platform based on this SMACK stack.

Storage

For performance reasons it’s best to start with a SSD hard drive, but also EBS volumes can be used. Using EBS we never experienced any shortcomings, though it seemed we had an increase in write queues. This usually happens if commit log and SSTables are written to the same storage. In cases like these it’s crucial to have a fast connected EBS at hand.

CPU

In short, more CPU will give you more throughput. As Cassandra uses different thread pools for write and read paths, an increase in number of CPUs helps tremendously, as those thread-pools have more dedicated CPUs.

Memory

As Cassandra had commodity hardware in mind when being designed, a lot of heap is of no use for a cassandra instance. A lot of RAM is rather useful for the underlying system itself, as the page cache will consume all the memory not used by other applications. Therefore configuring Cassandra with 8GB of RAM as heap plus 24GB for the underlying system for page caches is sufficient. It’s rather crucial to make sure the new generation heap is configured properly. As recommended in the Cassandra documentation the new generation of the heap should be set to 1400MB. Which is equivalent to 100MB times the number of CPUs. The rule of thumb is to either use 100MB times number of CPUs or a quarter of the maximum Heap, where the lesser number needs to be used.
As the system is run with a Java 8 runtime, garbage collection can be set to the GC1 garbage collector.

Apache Kafka

Like Apache Cassandra and other Big-Data systems, Kafka has also been designed with commodity hardware in mind. Therefore around 5GB of Heap for the Kafka process is enough. Again here it’s more important to have enough RAM for the hard drive caches as for Kafka heap usage. Regarding storage for Kafka the same principle applies. SSD should be favored but a fast connected EBS storage is sufficient. Kafka and its thread pools also profit greatly by a higher number of CPUs.
As Cassandra and Kafka in a production use-case are consuming rather lot of cpu and harddrive, one should consider to run those frameworks on dedicated machines. This does break the rule of “every framework is treated equal”, but especially the page caching mechanism only works if those applications are the only user of all resources.

Apache Spark

Apache Spark is already optimized to run on Apache Mesos. So after the easiness of installing it via a dcos package install spark, spark is ready to be used. Spark comes with a default Mesos scheduler, the MesosClusterDispatcher also known as Spark Master. All spark jobs will register themselves with the master and in turn will also be a Mesos framework. This driver is negotiating with Mesos about the required resources. From there on it takes care about the executors for Spark. Within this scenario of using DC/OS the executors are docker images with a DC/OS optimized configuration. They already contain configurations for access of a HDFS inside the DC/OS cluster.

Metrics

Spark metrics are nice while being watched in real time, but it would be nice to have the metrics available all the time, especially since after the death of the driver this data is gone. One way is to use the Spark history server. The history server is nice, but requires an HDFS to be available. This alone isn’t much of a downside but the requirements on running HDFS on DC/OS is rather high. At least 5 instances are required with lots of hard-drive. This just for taking a look at the Spark metrics is rather expensive. Therefore a good possibility is to use ELK (ElasticSearch, Logstash and Kibana) for monitoring. But how do we get to the logs of Spark. In a “regular” environment, you’ll usually just add some logging details to the executors, but as our executors are started by mesos and managed by the Spark Driver this needs some extra tweaking in the DC/OS world.

To enable the spark metrics a configuration is needed as the following:

1# Enable Slf4jSink for all instances by class name
2*.sink.slf4j.class=org.apache.spark.metrics.sink.Slf4jSink
3 
4# Polling period for Slf4JSink
5*.sink.slf4j.period=1
6 
7*.sink.slf4j.unit=seconds

With these settings Spark logs all metrics available to the std logging mechanism. The tricky part is to actually have those settings enabled inside the docker image provided by DC/OS. Actually this isn’t possible, so we built a custom Docker image already containing these settings.
The Dockerfile for such a preconfigured Docker image can be seen below, it’s not much magic.

1FROM  mesosphere/spark:1.0.2-2.0.0
2 
3ADD ./conf/metrics.properties /opt/spark/dist/conf/

External Storage

As you’ve seen in the previous section, enabling a HDFS system can be quite costly in terms of storage. In this scenario an extra HDFS system isn’t required, therefore storing data in S3 is a quick win. As with the metrics, DC/OS’ own spark image is optimized for HDFS and therefore needs some tweaking for accessing S3.
First, it is crucial to have those S3 accessing libraries available in your docker image. Second, you’ll need to make sure those newly available libraries are present in the configuration of your executor.

spark-defaults.conf content:

1spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
2spark.executor.extraClassPath /opt/spark/dist/extralibs/aws-java-sdk-1.7.4.jar:/opt/spark/dist/extralibs/hadoop-aws-2.7.2.jar:/opt/spark/dist/extralibs/joda-time-2.9.jar
3spark.driver.extraClassPath /opt/spark/dist/extralibs/aws-java-sdk-1.7.4.jar:/opt/spark/dist/extralibs/hadoop-aws-2.7.2.jar:/opt/spark/dist/extralibs/joda-time-2.9.jar

In the configuration above, there is an additional section setting the filesystem type to S3A which is needed for a faster access to it. The Apache Hadoop driver supports two different ways of accessing S3, one the default S3 is a block based access on the S3, while the S3N or S3A use object based access on S3. Hadoop now only supports S3A as it’s the successor of S3N (native). The extraClassPath entries are needed for the driver and also for the executor. An additional step of creating your own Docker image is to make sure those libraries are available.

Using a custom Docker spark image

When submitting Spark jobs via DC/OS usually you’ll issue a command as the following:

1dcos spark run --submit-args='--driver-cores 0.1 --driver-memory 1024M --class org.apache.spark.examples.SparkPi https://downloads.mesosphere.com/spark/assets/spark-examples_2.10-1.4.0-SNAPSHOT.jar 10000000'

Now if you want to use your own custom docker image you need to adapt the command to look like the following.

dcos spark run –submit-args =’…’ –docker-image=my-own-docker/spark-driver

In that case you need to make sure this docker image is accessible to your Mesos agents.
Because installing those Spark job can be cumbersome, we used Metronome for an easy installation of those Spark jobs.

Throughput

After all those technical implications let’s take a look at what can be achieved with such a SMACK platform. Our requirements contained the ability to process approximately 130 thousand messages per second. Those messages needed to be stored in a Cassandra and also be accessible via a frontend for real time visualization.
This scenario is build on the basic architecture. Akka Streams are used for ingestion and also for the real-time visualization. The ingestion does some minor protocol transformation and publishes the data to a Kafka sink. While 4 nodes are capable of handling these 130k msg/s, 8 ingestion instances are needed to keep up with 520k msg/s.
A high throughput in the ingestion is nice, but how much delay does it produce? For example: are those messages queuing somewhere internally Is the back-pressure to high? While being able to handle the throughput of 520k msg/s the delay has an average of 250ms of latency between message creation and storage and Kafka. If the network and ntpd fuzziness is taken into account, the delay produced by the ingest itself is comparable to be nonexistent.
So the ingestion and therefore the visualisation consumer are capable of handling the data in real-time. How does storing the data in Cassandra compare to this?
The easiest use-case for Apache Spark in this scenario is to stream the incoming data into the Apache Cassandra database. While Kafka topics are perfectly suited to be consumed in a streaming way, Cassandra isn’t capable of doing streamed inserts. That’s one of the reasons the largest amount of time is consumed by connecting and communicating with Cassandra. Therefore it’s better to have a batch interval of about 10 seconds. The base-line for connection and minimum time to just get going is about 1 to 2 seconds depending on the underlying hardware and amount of CPUs allocated for the executors. The 10 second window frame also helps handling possible but not envied downtimes of the Spark Job. With this time-buffer it’s possible to handle a longer downtime. Another reason to keep the 10 second window, data might happen to be hold back in the pipeline before consuming in Spark. This will also address this issue of a peak-load.
So let’s talk numbers: a Spark job storing data into Cassandra batched for 10 seconds takes about 4 seconds for 1.3 Million events within the batch window. To store the 5.2 Million events inside the batch window, Spark takes 8 seconds to store those events in Cassandra. As Cassandra is a rather large stakeholder when streaming data from Kafka to Cassandra via Spark, it also needs to scale horizontally with the amount of data. The following table gives a brief overview of amount of messages, average latency per batch and amount of Cassandra nodes needed to handle the amount of messages per second.

msg/s	Time needed Batched 10s	Cassandra Nodes
130k msg/s	4s	5
260k msg/s	6s	10
520k msg/s	8s	15

When measuring the throughput of simple events of approximately 60 byte size, the initially upper limit was reached by around 500k msg/s. It turned out to be a network capacity limit of the selected AWS box. This limit of 20MB/s correlated nicely with the approximately 500k msg/s and their corresponding byte size. When selecting another type of box with better network settings as entrypoint for the incoming data, processing 520k msg/s was easily achieved.

Conclusion

Sometimes working with bleeding-edge software is more like grabbing into falling daggers, on the other hand it’s great to see it turn into an extremely powerful and capable stack. With time progressing, the Mesosphere DC/OS provided functionality turned more stable and more sophisticated. The out-of-the-Box frameworks for the SMACK stack play nicely, and due to those frameworks the stack is resilient concerning failures. A “misbehaving” app will be stopped by the framework while being restarted in the same minute. This failsafe and failure-tolerant behavior gives great confidence in the running cluster as it rarely needs operation to be in charge.
It turned out to be obvious, but this stack scales linearly regarding performance, throughput and used hardware. But not only does it scale up, it also helps to start with a smaller stack. Especially using Apache Mesos as foundation helps to get the most out of your provided resources. This is especially useful when still developing the stack. In this case Apache Cassandra and Apache Kafka may be on the same node, as throughput isn’t the main goal, yet.
While Monitoring is needed it’s still not provided out of the box, a custom solution is needed. For monitoring purposes it’s still required to have your custom solution, with DC/OS 1.10 you’ll most likely will have a solution for metrics . DC/OS provided Frameworks place a collecting module next to the application, collecting the metrics of that process. Those metrics are locally collected and sent to a central server based on Apache Kafka. This will be an exciting new functionality for measuring metrics of your cluster in future.

Was this post helpful?

Blog author

Achim Nierbeck

Do you still have questions? Just send me a message.

fromAchim Nierbeck

Solution Factory – How to get from idea to product in 9 weeks

Digitization has been revolutionizing each and every business out there for the past few decades. It has surely a lot to offer in your business domain as well: a new customer portal to improve users’ satisfaction and help you reach out to a whole new...

Startup
Agile
AWS
Cloud
CI/CD
Software development
Agile methods

21.7.2019 | 9 minutes reading time

Mahdi Ebrahimi

Achim Nierbeck

SMACK Stack DC/OS Style

In the world of Internet of things (IoT) you work with a continuous flow of data. For this you have two options at hand, the first is to do batch processing long after the data is collected. The other option is to analyse the data while it is being collected...

31.7.2016 | 6 minutes reading time

Achim Nierbeck

IoT Analytics Platform

The Internet of Things a.k.a. the next industrial revolution is the current hype, but what kinds of challenges do we face with the consumption of big amounts of data? One variant is to collect all the data and do post processing in batches. However, ...

Cloud
IoT
NoSQL
Scala
Big Data

13.7.2016 | 15 minutes reading time

Achim Nierbeck

Combining Apache Cassandra with Apache Karaf

Getting the best of Apache Cassandra inside Apache Karaf: this blog post will describe how easy it was to embed the NoSQL database inside the runtime. This can be helpful while developing OSGi-related applications with Karaf that work together with Cassandra...

NoSQL
Container

19.12.2014 | 9 minutes reading time

Achim Nierbeck

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Kick-start your microservice project with JHipster

I recently looked for a solution on how to prototype a customer project in a short time and came across JHipster. The target architecture used Spring Boot in the backend and an Angular frontend. JHipster can scaffold this in its simplest variant as...

Node.js
Angular
Software development
Container
NoSQL
Cloud
JavaScript
Java
Keycloak
Kubernetes
Microservices
IT-Security
Open Source
React
Spring

12.5.2020 | 13 minutes reading time

Jörg Riegel

Golang, Gin & MongoDB – Building microservices easily

Golang, a.k.a. Go, has been around in the industry for quite some time now, but people are still reluctant to just go ahead and use it. To help you get started, follow me on this journey and create your first microservice using Golang, Gin and Docker...

Cloud
Container
Go
Microservices
NoSQL

21.4.2020 | 10 minutes reading time

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Decrypt TLS traffic to Kafka using Wireshark

Usually, debugging issues related to TLS in a Java application involves setting the debug flag -Djavax.net.debug=ALL. Unfortunately, the logs will then be flooded with debug entries, which makes troubleshooting the TLS communication difficult. Moreover...

Software development
Messaging
IT-Security

30.1.2020 | 4 minutes reading time

Using Apache PLC4X and ElasticSearch for IIoT monitoring and anomaly detection

Industrial IoT (IIoT) as a buzzword gained traction within recent years. However, implementing common use cases like real-time monitoring of PLCs may involve a huge amount of money and effort. For example, current approaches implementing such a monitoring...

NoSQL
IoT
IIoT

7.10.2019 | 6 minutes reading time

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Having worked with Google Cloud Platform’s Big Data Services for almost a year, I wanted to have a broader view on GCP’s capabilities. In this post, I will give you an overview of the services touched by the Coursera specialization . I have been a GCP...

Cloud
Computer Vision
Data
Machine Learning
Big Data
Google Cloud
Serverless

8.9.2019 | 6 minutes reading time

Niklas Haas

Hands-on Spark intro: Cross Join customers and products with business ...

In this blog post, I want to share my aha moments with you I had during the development of my first (Py) Spark application.We do this by an example application:Read customers, products, and transactions.Create the cross join of customers and products...

Python
Big Data
Data

5.8.2019 | 20 minutes reading time

Niklas Haas

Move n-gram extraction into your Keras model!

Move n-gram extraction into your Keras model!In a project on large-scale text classification, a colleague of mine significantly raised the accuracy of our Keras model by feeding it with bigrams and trigrams instead of single characters. For his experiments...

AI
NLP
Big Data
Python
Data

18.7.2019 | 7 minutes reading time

Window Functions in Stream Analytics

Introduction to Stream AnalyticsWhy should we talk about stream analytics? In the past decades data analytics was dominated by batch processing. Records from transactional databases were copied into analytical databases by regular extract-transform-load...

Big Data
Data
Streaming

11.10.2018 | 11 minutes reading time

How to write a Kotlin DSL – e.g. for Apache Kafka

The Kotlin language is gaining more and more attention and is being used in an increasing number of projects. One thing that Kotlin can be used for is implementing special domain-specific-languages (DSLs). The Wikipedia entry on DSL states:A domain...

Messaging
DSL
Kotlin

23.6.2018 | 10 minutes reading time

Cloud Launcher for MongoDB in the Google Compute Engine

In this post you will learn how to use Google’s Cloud Launcher to set up instances for a MongoDB replica set in the Google Compute Engine.Replication in MongoDBA minimal MongoDB replica set consists of two data bearing nodes and one so-called arbiter...

Cloud
Infrastructure as Code
Google
NoSQL

5.3.2018 | 3 minutes reading time

Tobias Trelle

ETL with Kafka

“ETL with Kafka” is a catchy phrase that I purposely chose for this post instead of a more precise title like “Building a data pipeline with Kafka Connect”.TLDRYou don’t need to write any code for pushing data into Kafka, instead just choose your connector...

Messaging
Big Data

2.3.2018 | 4 minutes reading time

Deep Learning Workshop at codecentric AG in Solingen

Big Data – a buzz word you can find everywhere these days, from nerdy blogs to scientific research papers and even in the news. But how does Big Data Analysis work, exactly? In order to find that out, I attended the workshop on “Deep Learning with Keras...

Big Data
Data
AI
Machine Learning

6.2.2018 | 6 minutes reading time

Shirin Elsinghorst

Change Streams in MongoDB 3.6

MongoDB 3.6 introduces an interesting API enhancement called change streams. With change streams you can watch for changes to certain collections by means of the driver API. This feature replaces all the custom oplog watcher implementations out there...

Change Management
NoSQL

15.1.2018 | 2 minutes reading time

Tobias Trelle

BigchainDB – The lightweight blockchain framework [blockcentric #5]

With BigchainDB we see one of the first complete but simple blockchain frameworks. The project strives to make blockchain usable for a large number of developers and use cases without requiring special knowledge in cryptography and distributed systems...

Big Data
Blockchain

3.1.2018 | 5 minutes reading time

SMACK stack from the trenches

DC/OS – Apache Mesos

Apache Cassandra

Storage

CPU

Memory

Apache Kafka

Apache Spark

Metrics

External Storage

spark-defaults.conf content:

Using a custom Docker spark image

Throughput

Conclusion

Was this post helpful?

Blog author

More articles

Solution Factory – How to get from idea to product in 9 weeks

SMACK Stack DC/OS Style

IoT Analytics Platform

Combining Apache Cassandra with Apache Karaf

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

The universal recommender in Action(ML)

Thinking AI means re-thinking data

Kick-start your microservice project with JHipster

Golang, Gin & MongoDB – Building microservices easily

From PDF data sheets to shared understanding with serverless SHACL

Decrypt TLS traffic to Kafka using Wireshark

Using Apache PLC4X and ElasticSearch for IIoT monitoring and anomaly detection

Thoughts after completing the Coursera “Data Engineering, Big Data, and...

Hands-on Spark intro: Cross Join customers and products with business ...

Move n-gram extraction into your Keras model!

Window Functions in Stream Analytics

How to write a Kotlin DSL – e.g. for Apache Kafka

Cloud Launcher for MongoDB in the Google Compute Engine

ETL with Kafka

Deep Learning Workshop at codecentric AG in Solingen

Change Streams in MongoDB 3.6

BigchainDB – The lightweight blockchain framework [blockcentric #5]