Installing a Hadoop Cluster with three Commands

22.4.2014 | 5 minutes reading time

Hadoop provisioning automation following the “Infrastructure as Code” paradigm

What is the quickest and best way to get a virtual Hadoop cluster running on your development machine?

One option is the use of golden images, like those prepared by Hortonworks or Cloudera . These are virtual machines that come completely configured with tutorials and everything – however, its just one virtual machine and they are really geared towards initial learning without too much room to configure.

The other option is to use Ambari (the graphical monitoring and management environment for Hadoop) to configure Hadoop on your virtual machines. Ambari is getting better almost daily and has reached a level of maturity that makes using an Ambari managed mini-cluster easily superior to using the above mentioned golden images for everything but maybe running your very first tutorial.

Recently Hortonworks has described the steps to use Ambari to configure virtual machines here – but this manual still has something like 20 commands just to setup Ambari – and its still just one node (not much of a cluster…).

Adding Puppet to the mix we can do better – here we’ll give you the tools and show you how to setup a 3 (virtual) node Hadoop cluster managed by Ambari with just three commands.

Prerequisites

You need to have a few things to get started.

A decent machine – we will run three virtual machines with 2GB RAM each – we tried this only on machines with 16GB RAM and you should have 8GB at the very least.
Vagrant – A tool that helps to manage virtual development environments. You need to have this installed – together with a provider for virtual machines such as Virtual Box. Please make sure that your versions are current (Vagrant does not(!) automatically alert you to the availability of new versions).

Setup Virtual Machines and Install Ambari

Open a terminal window, download and unzip the vagrant and puppet files that we created:

curl "http://vzach.de/data/ambari-provisioning.zip" 
   -o "ambari-provisioning.zip"
unzip ambari-provisioning.zip

These files contain the Puppet code (a tool for the automation of configuration management that is supported by Vagrant) to setup the virtual machines to run Ambari (which in turn will setup Hadoop on your virtual cluster). For convenience these files also include the puppet standard library stdlib.
Now change into the newly created ambari-provisioning directory and start everything by typing

vagrant up

Then grab a coffee and find something nice to read – it will take a while (expect something like 15 minutes, but it very much depends on your machine and your internet connection).

What happens is that first a virtual machine image for CentOS is downloaded, three virtual machines (named one, two and three) are created based on this image and the virtual machines are configured to run Ambari: firewall services are stopped, ntp is installed and started, etchost files are changed to enable communication between the virtual machines, the agent/clients are installed&started and finally the Ambari clients are given information on where to find the server machine. Machine “one” will run the Ambari server, all three machines will run Ambari agents. The files only change the configuration of the virtual machines (that are not accessible from the global internet), nothing is installed directly on your machine. You can see all of this by looking at the puppet modules in the downloaded folder (all in all its just around 250 LOC – not including the puppet standard lib stdlib that we included for convenience). You can find an explanation of the structure and content of such files in this (German) introduction to Vagrant and Puppet .

Configure your Hadoop Cluster

Now you can use the graphical interface of Ambari to setup and configure your cluster – just open 192.168.0.101:8080, login with the default Ambari user & password (admin, admin), name your cluster, choose a service stack such as the default HDP2.0. Then enter the hostnames and choose manual configuration as shown below (the system will warn you twice that you need to manually install Ambari agents on all machines – but don’t worry, we did this for you already):

Ambari Installation Options

Choose some services that you want to run and the machines they should run on

Ambari Hadoop service selection

Assign service masters to nodes

Fill in missing configuration info.

Customize services

And deploy. This again will take quite long (30 minutes and more), but will run completely unattended and – on a decent developer machine – you should be able to continue working in the meantime.

That’s it – you should now have a complete Hadoop cluster with all the services you configured.

Ambari monitoring dashboard

Conclusion

This here is just a fun technological demonstration, however, there is some seriousness in our motivation to do it: the techniques used here can be used to manage standardised test and development environments together with the code (the “Infrastructure as Code” vision), ensuring that all developers immediately and easily have access to such environments and that these environments can be versioned together with the rest of the codebase. And you can go even further – the code we created can be used to provision on real (not virtual) machines (see here ) and even the manual configuration with Ambari can be automated – but we show this in some later blog post.

Update

We’ve now also made the puppet module used available on Puppet Forge here . However, note that this is not everything we used in this Blog Post (the vagrant file is not included and the module etchosts -that ensures that the virtual nodes can find themselves – is also not included as it is not generally needed).

Authors

Valentin Zacharias and Malte Nottmeyer.

Was this post helpful?

Blog author

Valentin Zacharias

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Using External Secrets with Crossplane & ArgoCD

Most Crossplane providers need to authenticate themself against Cloud infrastructure providers. But how do we store these Secrets in a GitOps fashion? If external secret stores are a great way of doing this: How do we successfully integrate them with...

Infrastructure as Code
Platform engineering
DevOps
Cloud native

30.9.2024 | 15 minutes reading time

Going full GitOps with Crossplane & ArgoCD

In the last post we already deployed Crossplane with ArgoCD in a GitOps-fashion. But what about Crossplane providers and their configuration? And can't we optimize the boostrapping with the ArgoCD App-of-Apps pattern? We can! And we'll also provision...

Cloud native
Platform engineering
DevOps
Infrastructure as Code

9.9.2024 | 13 minutes reading time

Bootstrapping Crossplane with ArgoCD

After going into detail about why the integration of Crossplane and ArgoCD is a great way to unlock a new level of GitOps, I promised to dive into the details of such a setup. Here we are! Let's have a look at the basic steps how to use Crossplane together...

Infrastructure as Code
Platform engineering
DevOps
Cloud native

2.9.2024 | 11 minutes reading time

From Classic CI/CD to GitOps with ArgoCD & Crossplane

Lately I found a passion in integrating Crossplane with ArgoCD and finally wanted to write about all the steps needed to create a full blown working setup of both. Just as I finished the code and tried to find a good start into the topic, I found that...

DevOps
Platform engineering
Cloud native
Infrastructure as Code

27.8.2024 | 8 minutes reading time

Spring Boot and HTMX: Deployment to AWS Lambda

This is the next part of my series about Spring Boot and HTMX. In this post, I will show you how to deploy the application created in the previous post to AWS Lambda. If you're in a hurry or impatient, you can simply check out the accompanying Git Repo...

Serverless
Spring
AWS
DevOps
Cloud

30.7.2024 | 5 minutes reading time

Create, build & publish Crossplane Configuration Packages with GitHub ...

You already created your first Crossplane Compositions? Pretty nice! But how to store them in Git? How to create and build a Configuration Package from it? And finally: how to publish and consume these Configurations in your Crossplane management cluster...

DevOps
Platform engineering
Cloud native
Infrastructure as Code

3.6.2024 | 14 minutes reading time

Testing Crossplane Compositions with kuttl, Part 2: Given, When, Assert

In the first part of this blog series we learned about kuttl and why it's a great idea to write tests for your Crossplane Compositions. Now it's time to set up the kuttl test steps to finally verify our Composition renders correctly. Crossplane – blog...

Infrastructure as Code
Cloud native
Platform engineering
DevOps

27.5.2024 | 16 minutes reading time

Testing Crossplane Compositions with kuttl, Part 1: Preparing the TestSuite

Does writing Kubernetes Manifests count as writing code? Should we still bother to test it? Sure! And with the Kubernetes Test Tool (kuttl) there's great tooling available. Let's explore how to use it with Crossplane. Crossplane – blog series 1. Tame...

Cloud native
Platform engineering
DevOps
Infrastructure as Code

21.5.2024 | 16 minutes reading time

Becoming a Data-Driven Company with Applied Data Products

In recent years, the hype surrounding the value of data has grown continuously, and a multitude of concepts and methods have emerged on how companies can become 'data-driven'. From strategic top management to detail-oriented data analysts attempts are...

Agile
Big Data
Data
Product management
Digitalization
Data Science
Business Intelligence

18.5.2024 | 9 minutes reading time

Dr. Florian Rademacher

An introduction to federated learning in an industrial context: Advanced

In the Machine Learning space, it was long believed that sharing learnings or weights was safe in the sense that the input data couldn't be extracted. However, this belief has been challenged by researchers coming out over the years. Nowadays, numerous...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 9 minutes reading time

An introduction to federated learning in an industrial context: Fundamentals

With the help of data, companies are able to make more informed decisions, optimize their workflows and gain an edge in the competitive world of business using the power of Machine Learning (ML). However, handling data has become increasingly difficult...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 8 minutes reading time

Charge your APIs Volume 9: Perfecting APIOps - API Monitoring with Checkly

Over the past series of blog posts, we've been exploring the fascinating world of API Operations (APIOps), diving deep into Continuous Integration, Continuous Deployment, load testing, API diffing, and API Portals and Marketplaces. We've built a robust...

GitHub
API
CI/CD

5.7.2023 | 3 minutes reading time

Daniel Kocot

Charge your APIs Volume 8: Expanding APIOps - API Portals and Marketplaces

In our previous blog posts, we've taken an exciting journey through the world of API Operations (APIOps), exploring concepts like Continuous Integration, Continuous Deployment, load testing with k6, and API diffing with Tufin/oasdiff. By integrating ...

GitHub
API
CI/CD

28.6.2023 | 2 minutes reading time

Daniel Kocot

Charge your APIs Volume 7: Enhancing APIOps - API Diffing with Tufin/oasdiff

Throughout our exploration of API Operations (APIOps), we've covered a range of concepts - from Continuous Integration and Deployment to API testing under stress. These pillars of APIOps have brought us invaluable insights, helping to streamline our ...

API
GitHub
CI/CD

21.6.2023 | 2 minutes reading time

Daniel Kocot

Charge your APIs Volume 5: Taking APIOps with Continuous Deployment to...

In our previous exploration of API Operations (APIOps), we navigated the landscape of "Streamlining your API Operations with Continuous Integration", delving into how this practice can refine our approach towards API development and management. We examined...

CI/CD
API
GitHub

7.6.2023 | 3 minutes reading time

Daniel Kocot

DOTNET CI/CD with Gitlab

While CI/CD is easy for .NET if you use Github, it's much more work if you are on Gitlab. While it's possible it's a lot of moving parts and I hope to simplify the process a little bit. Currently, I want a couple of things out of a basic CI/CD pipeline...

GitLab
CI/CD
.NET

10.1.2023 | 6 minutes reading time

IoT fleet management: A comparison of balena and Portainer

When your system contains many IoT devices that are scattered over a large production facility or even distributed over multiple facilities, it is important that you can manage and update the deployed software, access logs and easily provision new devices...

IoT
IIoT
DevOps
Container
Raspberry Pi

10.1.2023 | 8 minutes reading time

Florian Lüdiger

Time to Renovate

How to keep your IT infrastructure up to date and reduce manual effort to a minimum by using Kubernetes, Helm, GitOps (FluxCD), Continuous Integration (GitLab-CI) and Renovate. When we moved into our house, everything was new and shiny. Well – it was...

DevOps
Infrastructure as Code

19.12.2022 | 8 minutes reading time

Introduction to GitOps with ArgoCD

In this post you will learn what GitOps is about and see the steps to create a setup on your laptop to gain some experience with ArgoCD. Using an industry standard container orchestrator such as Kubernetes, this enables developers to continuously deploy...

CI/CD
Kubernetes
GitHub
Open Source
DevOps
Container
Infrastructure as Code
Infrastructure
Spring

31.10.2022 | 10 minutes reading time

Open Policy Agent – Primer

The Open Policy Agent (OPA) is a general-purpose, open-source policy engine, i.e. a collection of components that allows for a uniform and efficient implementation of rules of all kinds. This article shows a small practical example. When was the last...

CI/CD
Software architecture
IT-Security

19.10.2022 | 5 minutes reading time

Marco Paga

Installing a Hadoop Cluster with three Commands

Hadoop provisioning automation following the “Infrastructure as Code” paradigm

Prerequisites

Setup Virtual Machines and Install Ambari

Configure your Hadoop Cluster

Conclusion

Update

Authors

Was this post helpful?

Blog author

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Using External Secrets with Crossplane & ArgoCD

Going full GitOps with Crossplane & ArgoCD

Bootstrapping Crossplane with ArgoCD

From Classic CI/CD to GitOps with ArgoCD & Crossplane

Spring Boot and HTMX: Deployment to AWS Lambda

Create, build & publish Crossplane Configuration Packages with GitHub ...

Testing Crossplane Compositions with kuttl, Part 2: Given, When, Assert

Testing Crossplane Compositions with kuttl, Part 1: Preparing the TestSuite

Becoming a Data-Driven Company with Applied Data Products

An introduction to federated learning in an industrial context: Advanced

An introduction to federated learning in an industrial context: Fundamentals

Charge your APIs Volume 9: Perfecting APIOps - API Monitoring with Checkly

Charge your APIs Volume 8: Expanding APIOps - API Portals and Marketplaces

Charge your APIs Volume 7: Enhancing APIOps - API Diffing with Tufin/oasdiff

Charge your APIs Volume 5: Taking APIOps with Continuous Deployment to...

DOTNET CI/CD with Gitlab

IoT fleet management: A comparison of balena and Portainer

Time to Renovate

Introduction to GitOps with ArgoCD

Open Policy Agent – Primer