In this blog post we want to give a tutorial to the brand new Ambari Blueprints. These blueprints allow to automate the configuration of Hadoop clusters – and together with Vagrant , Foreman and Puppet they are the last missing component to completely describe a Hadoop cluster in code and have it run both on virtual machines and on Bare Metal automatically. This allows to quickly create development and test clusters that are (possibly with the exception of size) identical to the production environment.
As an example for this tutorial we use a realization of the Lambda Architecture. You can read about the Lambda Architecture here – but in the end you need to know nothing more than that we want a Hadoop 2 cluster with HDFS (to store files), HBase (to store precomputed views – both views created in batch processes as well as realtime views), Storm (to process data in realtime), Map Reduce (to process data as batches) and finally Pig to make it easiert to create views. Of course we want to use Tez to speed up our processing.
Prerequisites
In previous blog posts we described how to provision virtual or bare metal machines automatically to build your own Hadoop cluster. In both we provided and configured Ambari for you. So if you followed us there you will meet these requirements. Else check the following points. You will need:
- an installed Ambari server on one node
- an installed Ambari agent on all nodes
- a working dns environment including a domain name for every node
- disabled or properly configured services that could interfere with or block Ambari
- the ntp service installed on all nodes
Ambari
When installing a Hadoop cluster, Ambari provides an easy way to install a customizable stack of Hadoop services without needing to worry about the details of the installation. You simply install Ambari and click your way through the user interface.
With Ambari 1.5.0 the new blueprint feature was introduced (but not publicized widely). It allows to programmatically set most of the configurations supported by the Ambari UI. You can define the Hadoop service combination for each host plus a set of cluster scoped configurations. Host scoped configurations are not yet available.
The feature consists of two main components, the blueprint itself and a host mapping. Both are JSON objects and can be exchanged with the Ambari REST API.
Blueprints
A blueprint defines the logical structure of a cluster, without needing informations about the actual infrastructure. Therefore you can use the same blueprint for different amount of nodes, different IPs and different domain names.
The base structure is defined by the top JSON elements “configurations”, “host_groups” and “Blueprints”: (The “configurations” element is optional.)
1{ 2 "configurations" : [{ ... }, { ... }, ... ], 3 "host_groups" : [{ ... }, { ... }, ... ], 4 "Blueprints" : { ... } 5}
Configurations
This element allows to set the configuration options for the services running within the Hadoop cluster. It is structured as an array of configuration types. Every type is identified by an unique name and contains key value pairs of specific settings. For example: “global” is an configuration type which also contains the setting “namenode_heapsize”:
1"configurations" : [ { "global" : { "namenode_heapsize" : "1536m", ... } }, { ... } ]
You can read out the internal types and keys for a setting from the user interface. We also collected the most common settings here (look for the “type” and “properties” attributes). But keep in mind that some values are specific to the cluster we used. Also some dynamically generated property names (e.g. they contain a user name) might not work yet. We recommend to only specify the values that you really want to be different from the defaults.
If you want to read out your existing configuration of a running Ambari cluster, you can do that with the following HTTP calls (When using a browser make sure to log into Ambari first. All HTTP calls here are relative to the Ambari server’s base HTTP address.)
1GET /api/v1/clusters/c1/configurations 2 - shows you which configuration types and tags you are using 3 4GET /api/v1/clusters/c1/configurations?type=INSERT_TYPE&tag=INSERT_TAG 5 - then shows you the setting values (tag normally equals 1)
Host Groups
A host group has a specified name (unique in the same blueprint) plus a cardinality and contains a combination of Hadoop service components. So a host group defines a server type in a cluster: every server in one host group will get the same service components installed. The cardinality is the number of servers that should be in a specific host group. It seems that this attribute is not restrictive. You can set it to a higher value or even to “*”.
For a one node HDFS setup, this would for example look like this:
1"host_groups":[ 2 { "name":"host_group_1", 3 "components":[ 4 { "name":"ZOOKEEPER_SERVER" }, 5 { "name":"ZOOKEEPER_CLIENT" }, 6 { "name":"AMBARI_SERVER" }, 7 { "name":"NAMENODE" }, 8 { "name":"HDFS_CLIENT" }, 9 { "name":"SECONDARY_NAMENODE" }, 10 { "name":"DATANODE" }, ... ], 11 "cardinality":"1" }, ... ]
The component names are Ambari specific, for convenience you can find the the HDP-2.1 services with their components below.
1HDFS DATANODE, HDFS_CLIENT, JOURNALNODE, NAMENODE, SECONDARY_NAMENODE, ZKFC 2YARN APP_TIMELINE_SERVER, NODEMANAGER, RESOURCEMANAGER, YARN_CLIENT 3MAPREDUCE2 HISTORYSERVER, MAPREDUCE2_CLIENT 4GANGLIA GANGLIA_MONITOR, GANGLIA_SERVER 5HBASE HBASE_CLIENT, HBASE_MASTER, HBASE_REGIONSERVER 6HIVE HIVE_CLIENT, HIVE_METASTORE, HIVE_SERVER, MYSQL_SERVER 7HCATALOG HCAT 8WEBHCAT WEBHCAT_SERVER 9NAGIOS NAGIOS_SERVER 10OOZIE OOZIE_CLIENT, OOZIE_SERVER 11PIG PIG 12SQOOP SQOOP 13STORM DRPC_SERVER, NIMBUS, STORM_REST_API, STORM_UI_SERVER, SUPERVISOR 14TEZ TEZ_CLIENT 15FALCON FALCON_CLIENT, FALCON_SERVER 16ZOOKEEPER ZOOKEEPER_CLIENT, ZOOKEEPER_SERVER
Now you could craft yourself a few host groups and try to provision them. But keep in mind that there is no validation of your component combinations when using a blueprint. Therefore you should take the requirements of each component into account.
To be safe it is possible to retrieve the blueprint of an existing cluster with the following HTTP call:
1GET /api/v1/clusters/YOUR_CLUSTER_NAME?format=blueprint 2 - gives you the exact component combination of a cluster as raw blueprint 3 (without the cluster configuration!)
Thus you could also configure the components in the user interface (where you are supported with a bit of validation logic). You can retrieve the blueprints already from the moment the installation begins.
Caution: In the 1.5.1 version of Ambari there is a issue with using any HBase component in blueprints! However, you can still manually install HBase afterwards and even automate by intercepting and reusing the HTTP calls of the user interface.
Other
The final missing JSON element “Blueprints” only contains the blueprint name, the stack (HDP) and the stack version. The name will be important later when mapping a blueprint to an actual cluster.
1"Blueprints" : { 2 "blueprint_name" : "blueprint-c1", 3 "stack_name" : "HDP", 4 "stack_version" : "2.1" }
Host Mapping
For the actual cluster creation you also need a second JSON File. Basically the work left is to tell Ambari which blueprint it shoud use and which host should be in which host group. With the attribute “blueprint” you can define the name of the blueprint. Then you can define the hosts of each host group. e.g. we define the host “one.cluster” to be in “host_group_1” of “blueprint-c1” (ip is optional)
1{ "blueprint":"blueprint-c1", 2 "host-groups":[ 3 { "name":"host_group_1", 4 "hosts":[ 5 { "fqdn":"one.cluster", 6 "ip":"192.168.0.101" }, ... ] }, ... ] }
Now there is only one question left: How do you create the cluster? It’s as simple as two REST calls: (Ambari requires you to include the header: ‘X-Requested-By:MY_COMPANY’. See our example for how to trigger the requests.)
1POST /api/v1/blueprints/BLUEPRINT_NAME blueprint.json 2 - makes the blueprint available to ambari 3 4POST /api/v1/clusters/CLUSTER_NAME hostmapping.json 5 - merges the blueprint with the host mapping into a cluster
After these calls, your cluster begins to install. You can log into Ambari and watch the installation or do other stuff. To see this in action, follow us through our example:
Target Cluster
For demonstration purposes we continue the three virtual machines example from our first blog post. We will provide you with a Lambda Architecture blueprint and host mapping fitting to these virtual machines. You can also follow our example with your own infrastructure. Simply adapt it where needed.
Our target cluster will therefore consist of three (virtual) machines. On such a small amount of machines we will need a host group for each one and distribute the “heavy” services equally among them. Also every machine will get the standard and client services.
To recall, we want our cluster to fulfill the Lambda Architecture functionalities. In general this means to combine a realtime and batch computation to one consistent realtime big data context. The realtime results (from e.g. Storm) and the batch results (from e.g. Map Reduce 2 + Pig) can be combined and stored in HBase. Every other service that we specifiy in the blueprint provides the base for Storm, Map Reduce 2, Pig and HBase: Distributed file storage (HDFS), resource management (YARN), execution engine (Tez), coordination service (ZooKeeper) and monitoring + metrics (Nagios + Ganglia).
The first VM is the monitoring and resource management node. The second node contains the Storm service components and the third node should handle the HBase master component. (The HBase components are omitted from the example blueprint, because they lead to a failure while installing in Ambari version 1.5.1.)
Example
The result of this consideration is the following blueprint:
1{ "host_groups" : [ 2 { "name" : "host_group_1", 3 "components" : [ 4 { "name" : "ZOOKEEPER_SERVER" }, 5 { "name" : "ZOOKEEPER_CLIENT" }, 6 { "name" : "PIG" }, 7 { "name" : "HISTORYSERVER" }, 8 { "name" : "SUPERVISOR" }, 9 { "name" : "NAGIOS_SERVER" }, 10 { "name" : "TEZ_CLIENT" }, 11 { "name" : "AMBARI_SERVER" }, 12 { "name" : "APP_TIMELINE_SERVER" }, 13 { "name" : "GANGLIA_SERVER" }, 14 { "name" : "HDFS_CLIENT" }, 15 { "name" : "NODEMANAGER" }, 16 { "name" : "YARN_CLIENT" }, 17 { "name" : "MAPREDUCE2_CLIENT" }, 18 { "name" : "DATANODE" }, 19 { "name" : "GANGLIA_MONITOR" }, 20 { "name" : "RESOURCEMANAGER" } ], 21 "cardinality" : "1" }, 22 { "name" : "host_group_2", 23 "components" : [ 24 { "name" : "ZOOKEEPER_SERVER" }, 25 { "name" : "ZOOKEEPER_CLIENT" }, 26 { "name" : "PIG" }, 27 { "name" : "STORM_REST_API" }, 28 { "name" : "STORM_UI_SERVER" }, 29 { "name" : "SUPERVISOR" }, 30 { "name" : "SECONDARY_NAMENODE" }, 31 { "name" : "TEZ_CLIENT" }, 32 { "name" : "HDFS_CLIENT" }, 33 { "name" : "NODEMANAGER" }, 34 { "name" : "YARN_CLIENT" }, 35 { "name" : "MAPREDUCE2_CLIENT" }, 36 { "name" : "DATANODE" }, 37 { "name" : "GANGLIA_MONITOR" }, 38 { "name" : "DRPC_SERVER" }, 39 { "name" : "NIMBUS" } ], 40 "cardinality" : "1" }, 41 { "name" : "host_group_3", 42 "components" : [ 43 { "name" : "ZOOKEEPER_SERVER" }, 44 { "name" : "ZOOKEEPER_CLIENT" }, 45 { "name" : "PIG" }, 46 { "name" : "NAMENODE" }, 47 { "name" : "SUPERVISOR" }, 48 { "name" : "TEZ_CLIENT" }, 49 { "name" : "HDFS_CLIENT" }, 50 { "name" : "NODEMANAGER" }, 51 { "name" : "YARN_CLIENT" }, 52 { "name" : "MAPREDUCE2_CLIENT" }, 53 { "name" : "DATANODE" }, 54 { "name" : "GANGLIA_MONITOR" } ], 55 "cardinality" : "1" } ], 56 "Blueprints" : { 57 "blueprint_name" : "blueprint-c1", 58 "stack_name" : "HDP", 59 "stack_version" : "2.1" } }
Mapping this to the three virtual machines is then quite easy to describe:
1{ "blueprint":"blueprint-c1", 2 "host-groups":[ 3 { "name":"host_group_1", 4 "hosts":[ { "fqdn":"one.cluster" } ] }, 5 { "name":"host_group_2", 6 "hosts":[ { "fqdn":"two.cluster" } ] }, 7 { "name":"host_group_3", 8 "hosts":[ { "fqdn":"three.cluster" } ] } ] }
Procedure
If you want to use this in action:
- Start up three virtual machines managed by Ambari and wait for them to start up (to do this you can use the resources provided here ).
- Choose your favorite way to trigger the already described POST requests (including the given JSON objects and needed header) or simply execute the following commands: (You can also run them in one of the virtual machines.)
1curl http://vzach.de/data/lamba-blueprint.json -o lamba-blueprint.json 2 3curl --user admin:admin -H 'X-Requested-By:mycompany' -X POST http://192.168.0.101:8080/api/v1/blueprints/blueprint-c1 -d @lamba-blueprint.json 4 5curl http://vzach.de/data/lamda-hostmapping.json -o lamda-hostmapping.json 6 7curl --user admin:admin -H 'X-Requested-By:mycompany' -X POST http://192.168.0.101:8080/api/v1/clusters/c1 -d @lamda-hostmapping.json
To further automate this, you could wrap the commands in a script and execute it once every needed machine is provisioned. You can query whether a machine is ready to use by Ambari with: GET /api/v1/hosts
Conclusion
We have seen how the configuration of a Hadoop cluster can be described in blueprints and how this makes it possible to manage this configuration together with the rest of the codebase. Together with Foreman and Puppet it is possible to go from bare metal to an installed cluster without the need for manual actions. However, we have to admit that right now you can try and experiement with it, but you best wait for Ambari version 1.6 before you fully integrate it into your work environment (its still experimental and some features – like provisioning of HBase – do not work yet).
The presented solution is particulary well suited for cases where you already have a cluster and only need to represent changes to this configuration. In such cases it is easy to generate an initial blueprint from the existing configuration and to apply changes.
Authors
Valentin Zacharias and Malte Nottmeyer
More articles
fromMalte Nottmeyer
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Malte Nottmeyer
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.