Calculating Pi with Apache Spark

16.4.2016 | 9 minutes reading time

Apache Spark is a system for cluster computing and part of the increasingly popular SMACK stack . The aim of this blog post is to provide a beginners introduction on how to set up a mini Spark cluster of virtual machines (VMs) using Vagrant and to run a small example application on it that approximates \(\pi\).

The cluster

To set up the Vagrant cluster on your local machine you need to first install Oracle VirtualBox on your system. After this it suffices to clone the Git repository from here to a working directory of your choice.

Once in the working directory, we can spin up the cluster using the console command vagrant up. The cluster is deployed in standalone mode and will consist of a designated master node named sparkmaster and a configurable number of worker nodes. The nodes are assigned consecutive static IP addresses and the workers are accessible via password-less SSH from the master node.

The following table summarizes the hostnames and IP addresses of the nodes and includes for later reference. It also includes the URLs to the web UIs provided by Spark on the nodes once the cluster is running:

Nodename	IP address	Web UI
sparkmaster	192.168.33.100	http://192.168.33.100:8080
sparkworker-01	192.168.33.101	http://192.168.33.101:8081
sparkworker-02	192.168.33.102	http://192.168.33.102:8082
etc.	etc.	etc.

After the cluster is up, we can use the command vagrant ssh to connect to the node with name nodename. For example, let us connect to the master node via vagrant ssh sparkmaster and have a look at its Spark installation directory:

1vagrant@sparkmaster:~$ ls -F $SPARK_HOME
2CHANGES.txt  NOTICE  README.md	bin/   data/  examples/  licenses/  python/
3LICENSE      R/      RELEASE	conf/  ec2/   lib/	 logs/	    sbin/

Spark comes with a couple of important directories containing executables and configuration files:

First of all, the directory SPARK_HOME/bin contains the spark-shell script for running Spark’s REPL (read-evaluate-print-loop), which allows interactive data exploration. But our main character here is the spark-submit script: it can be used to submit Spark applications as a JAR to the cluster.
Next, SPARK_HOME/conf contains the configuration files slaves and spark-env.sh. The first lists the hostnames of all VMs to be used as slaves while the second lists options used by Spark.
Finally, the directory SPARK_HOME/sbin will be important as it contains the shell scripts for starting and stopping the master as well as worker instances on the designated machines, either individually or in one go via the start-all.sh and stop-all.sh scripts.

We will start the master on the VM named sparkmaster while all the other VMs will be used as slaves. This can be achieved by running the start-all.sh script on sparkmaster:

1vagrant@sparkmaster:~$ $SPARK_HOME/sbin/start-all.sh

We might check that (hopefully) everything went smoothly by inspecting the log files in our cluster from the corresponding SPARK_HOME/logs directory on each individual machine. As said, the master and slave instances can be stopped by running the stop-all.sh script on sparkmaster.

Inspecting the web UI

More information is available from the Spark’s master Web UI:

Here we find the following information:

A list of all workers in the cluster under the section heading Workers.
Information on Running Applications and Completed Applications.

The UI is reachable as long as we do not deliberately stop the master by invoking one of the scripts for stopping it.

Submitting an application to the cluster

To actually submit an application to our cluster we make usage of the SPARK_HOME/bin/spark-submit.sh script. To test this and also that our cluster is set up properly, we will use the example applications for computing an approximation to \(\pi\) via Monte Carlo that ships with the Spark installation (Code: GitHub ).

For convenience the shared vagrant folder contains a shell script for submitting the example application to the cluster:

1spark-submit \
2--class de.codecentric.SparkPi \
3--master spark://192.168.33.100:7077  \
4--conf spark.eventLog.enabled=true \
5/vagrant/jars/spark-pi-example-1.0.jar 10

Besides a reference to the main class in the JAR and the path to the latter, we pass the IP address and port for the the Spark master instance and enable event logging. The latter will allow us to look at specific information in the web UI even after the application has finished. The argument 10 determines the size of the random sample used and also the degree of parallelism; see below.

If we invoke this script we get the result of the computation printed to the console. Also note the corresponding finished application after switching to the master Web UI in our browser:

1vagrant@sparkmaster:~$ /vagrant/scripts/submit-script-pi.sh
2Pi is roughly 3.13918

How is \(\pi\) approximated here?

This computation is based on the following heuristic: By definition \(\pi\) is the area \(A_{\mathrm{Circle}}\) of a circle with radius \(r=1\) (generally, \(\pi\cdot r^2\) is the area of a circle of radius \(r\)).

One then circumscribes this unit circle with a square whose area equals \(A_{\mathrm{Square}}=4\). The ratio of these two areas thus equals to \(\frac{A_{\mathrm{Circle}}}{A_{\mathrm{Square}}}=\frac{\pi}{4}\) and gives the geometric probability of a point inside the square to lie inside in the circle.
Now let us assume that we pick a huge number \(n\) of points randomly inside the circumscribed square, for example, by throwing darts or dropping rain drops onto it. A certain number \(n_{\mathrm{in}}\) of these points will end up inside the area described by the circle while the remaining number \(n_{\mathrm{out}}\) of these points will lie outside of it (but inside the square). Thus \(n_{\mathrm{in}}+n_{\mathrm{out}}=n\) and the probability of a point lying inside of the circle area is \(\frac{n_{\mathrm{in}}}{n}\).
Heuristically, one has \(\frac{A_{\mathrm{Circle}}}{A_{\mathrm{Square}}}\approx\frac{n_{\mathrm{in}}}{n}\) and hence \(\pi\approx\frac{4\cdot n_{\mathrm{in}}}{n}\).

It goes without saying, that this algorithm is non-deterministic and results will likely change with each run.

To wrap things up: The beauty of this is, it paves a way to approximate \(\pi\) by simply counting the fraction of points that end up inside the circle out of a total population of points randomly thrown at the circumscribed square. Something that can be distributed in a trivial fashion. And this is exactly what the mentioned Spark application does! A interactive visualization of the above may be found here: Link.

Subsequently, we drill down on some of the basic concepts of Spark by looking into the code of SparkPi . This includes speaking about the concept of a RDD, Spark’s abstract data type for handling data distributed on a cluster.

Resilient Distributed Datasets (RDDs)

Within the Spark world the core abstraction is that of a Resilient Distributed Dataset. The rationale is that we want to create, distribute and process data within a cluster that is created from various input data, e.g. text files or plain Java/Scala collections. These input data are structured by Spark into RDDs of which one can basically think of as Java/Scala collections that are distributed over the cluster into partitions. Spark provides a functional-programming style API for Java/Scala that allows to either

create new RDDs from various input sources, like files residing in HDFS, etc.
create new RDDs from already existing ones by so-called transformations, or
to create final Java/Scala values from existing RDDs by so-called actions.

To make these distributed data sets resilient or fault-tolerant, Spark keeps track of the dependencies between the input data and the intermediate RDDs created from it through an RDDs dependency graph. In case of failure this graph allows to replay the parts of the computation that were necessary to create the RDD at hand. It is important to note that RDDs are computed in a lazy fashion: only creating a final Java/Scala value via an action triggers the actual execution of a computation. Since the dependency graph in Spark is an example of a directed acyclic graph (DAG) this name is used as a reference frequently, for example in the web UI.

Writing a simple Spark application

To illustrate the ideas outlined in the previous section, let us rewrite the application SparkPi step by step. We will follow the original source but allow ourselves to divert a little from it in order to stress where and how RDDs are created and transformed. To begin with, the basic skeleton for the main application looks as follows:

1import scala.math.random
2import org.apache.spark._
3 
4object SparkApp {
5  def main(args: Array[String]): Unit = {
6    val conf = new SparkConf().setAppName("Spark Pi")
7    val sc = new SparkContext(conf)
8 
9    // Application code goes here...
10 
11    sc.stop()
12  }
13}

The main entry point to every Spark application is creating a SparkContext object. It provides a connection to the Spark cluster and context information about the cluster as well as the application itself and is used to create RDDs from input data. For example we are able to set the name of the application that will also appear in the Spark web UIs to be "Spark Pi". Further parameters might be passed to the Spark context at runtime as has already happened in the above usage of the submit script; there the IP address of the master node is passed to the Spark context.

The main step in the application code is to create a huge number n of random sample points by using the parallelize method provided by the Spark context sc . It allows to create an initial RDD from any Scala collection. In our case this collection, xs, consists of the first n consecutive numbers. The resulting RDD is divided into a number of slices partitions. Next, this RDD is transformed via map to the RDD sample that contains a number of n random points \((x,y)\) inside the square \([-1,1]\times [-1,1]\). Finally, we filter out the points from the sample that lie in the interior of the unit disc and count these in order to obtain an approximative value for \(\pi\). Here counting represents the final action that triggers the evaluation of all previous RDDs along the dependency graph.

1val slices = if (args.length < 0) args(0).toInt else 2 
2val n = math.min(100000L * slices, Int.MaxValue).toInt 
3val xs = 1 until n 
4val rdd = sc.parallelize(xs, slices)
5            .setName("'Initial rdd'") 
6val sample = rdd.map { i =>
7  val x = random * 2 - 1
8  val y = random * 2 - 1
9  (x, y)
10}.setName("'Random points sample'")
11 
12val inside = sample.filter { case (x, y) => (x * x + y * y < 1) }.setName("'Random points inside circle'")
13 
14val count = inside.count()
15 
16println("Pi is roughly " + 4.0 * count / n)

We can find a visual representation of the dependency graph of the final RDD inside after running the application by clicking either corresponding application id or name (here “SparkPi”) in the master web UI under the section “Completed Applications”. There one finds a link labeled “Application Detail UI”, which leads to more detailed information about the jobs and stages involved in the application. Our application includes only one job consisting solely of one stage, and by clicking on the corresponding link in the “Application Detail UI”, we finally find a representation of the dependency graph:

Notice that we were able to set names for debugging/monitoring purposes in the application code by using the setName method provided by the RDD class, and that these names also appear in the visual representation of the dependency graph. This is for example helpful when it comes to the identification of performance bottlenecks in larger applications that involve more intricated ways of creating and transforming RDDs.

That’s all! If you want, you can stop the cluster using vagrant halt or can completely get rid of it with vagrant destroy -f after exiting from the master’s shell.

Summary

In conclusion, we described how to set up a small Spark cluster using Vagrant, and how to write and submit a simple application to the cluster. Finally, we saw how to make basic usage of the web UI for monitoring purposes.

Was this post helpful?

Blog author

Daniel Pape

Do you still have questions? Just send me a message.

fromDaniel Pape

Matrix Factorization for Ad Recommendation

This blog post describes how matrix factorization can be applied to the problem of ad targeting. It draws from my experience of developing a machine-learning-based solution for this task for the real-time performance marketing company twiago together...

AWS
Data

14.3.2018 | 7 minutes reading time

Daniel Pape

Spark 2.0 – Datasets and case classes

The brand new major 2.0 release of Apache Spark was given out two days ago. One of its features is the unification of the DataFrame and Dataset APIs. While the DataFrame API has been part of Spark since the advent of Spark SQL (they replaced SchemaRDDs...

27.7.2016 | 7 minutes reading time

Daniel Pape

Spam classification using Spark’s DataFrames, ML and Zeppelin (Part 1)

This is the first entry in a series of blog posts about building and validating machine learning pipelines with Apache Spark . Its main concern is to show how to explore data with Spark and Apache Zeppelin notebooks in order to build machine learning...

Scala
Big Data
Data
Machine Learning

22.6.2016 | 15 minutes reading time

Daniel Pape

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

Python on an M1 chip: Running smoothly using Docker

I have been working as a data scientist at codecentric for several years now. Thus, my language of choice is Python and I am using it in several projects on a daily basis. Last year, I got pretty excited about the announcement of the new versions of ...

Data
Machine Learning
Apple
Python

14.2.2022 | 6 minutes reading time

Denis Stalz-John

Evaluating machine learning models: Establishing quality gates

The quality or usefulness of machine learning models can be evaluated using test data and metrics. However, to what extent? Manually, automated, once, regularly? Manually, the first models as the result of a proof of concept can certainly still be evaluated...

Data
Machine Learning
Software development
CI/CD

7.12.2021 | 8 minutes reading time

Berthold Schulte

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications includeautomation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 9 minutes reading time

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

Why user-oriented development is so important – the story of tactics.ai

In this blog post, we want to give you an insight into the product development of tactics.ai. Our initial idea was a data-driven football analysis tool that applies machine learning techniques to analyze the strengths and weaknesses of opponents and ...

Agile
AI
Startup
Machine Learning
Product management

23.8.2020 | 8 minutes reading time

Denis Stalz-John

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

From PDF data sheets to shared understanding with serverless SHACL

Knowledge contained in PDF filesWhen crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the ...

NoSQL
AWS
Big Data
Data
API
Microservices
Python
Serverless
Webdevelopment

1.4.2020 | 12 minutes reading time

Evaluating machine learning models: The issue with test data sets

Machine learning technologies can be used successfully and practically in a corporate environment. A concrete, manageable use case and thus focused application of machine learning models can generate real added value. This added value naturally depends...

Data
Machine Learning
Data Science

25.3.2020 | 6 minutes reading time

Berthold Schulte

Great Expectations: Validating datasets in machine learning pipelines

Typically your favorite machine learning model doesn’t care whether or not your input dataset is professionally and technically correct. However, particularly for machine learning algorithms, the all-encompassing truth garbage in, garbage out holds true...

Python
Data
Machine Learning

17.2.2020 | 7 minutes reading time

Marcel Mikl

Remote training with GitLab-CI and DVC

In many Data Science projects there is a point in time where the workstation under your desk is not the ideal machine to perform the model training anymore. More potent processors and GPUs are required, e.g. a suitable server in your company’s rack or...

Git
Machine Learning
CI/CD
AI
GitLab

27.1.2020 | 15 minutes reading time

Marcel Mikl

AWS SageMaker Machine Learning Data handling

Seven ways of handling image and machine learning data with AWS SageMaker and S3If you start using AWS machine learning services, you will have to dive into data handling with AWS SageMaker and S3. We want to show you seven ways of handling image and...

AWS
Computer Vision
Data
AI
Machine Learning

17.1.2020 | 10 minutes reading time

Realtime face detection and filtering with the Coral USB accelerator

In this blog post we explain how you can build your own face detection application without much machine learning knowledge. Why? At codecentric everyone has one day per week for professional development and training. Among other things we use this time...

Software architecture
Machine Learning

8.11.2019 | 10 minutes reading time

Christoph Knauf

Tackling climate change with machine learning [part 6] – Datasets & further...

Before we get started with this chapter, here is the full summary video, containing all 5 previous parts, enjoy! By loading the video, you agree to YouTube's privacy policy. Learn more Load video Always unblock YouTube The first 5 chapters of this...

Data
AI
Machine Learning

26.9.2019 | 4 minutes reading time

Calculating Pi with Apache Spark

The cluster

Inspecting the web UI

Submitting an application to the cluster

How is \(\pi\) approximated here?

Resilient Distributed Datasets (RDDs)

Writing a simple Spark application

Summary

Was this post helpful?

Blog author

More articles

Matrix Factorization for Ad Recommendation

Spark 2.0 – Datasets and case classes

Spam classification using Spark’s DataFrames, ML and Zeppelin (Part 1)

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

Python on an M1 chip: Running smoothly using Docker

Evaluating machine learning models: Establishing quality gates

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

DISH-O-TRON – Gather that DATA you must!

Why user-oriented development is so important – the story of tactics.ai

Thinking AI means re-thinking data

From PDF data sheets to shared understanding with serverless SHACL

Evaluating machine learning models: The issue with test data sets

Great Expectations: Validating datasets in machine learning pipelines

Remote training with GitLab-CI and DVC

AWS SageMaker Machine Learning Data handling

Realtime face detection and filtering with the Coral USB accelerator

Tackling climate change with machine learning [part 6] – Datasets & further...