After we went through a crash course on Docker in the first part of this series, it’s time to start installing our first services. In this second part we’ll cover Mesos and Hadoop. But before that, we’ll warm up by creating our first Docker image, which will serve as our base for all the others.

2. The Ubuntu base image

Here is the Dockerfile for our base image:

FROM ubuntu:14.04
# Creates /etc/ and runs this script as entry point
RUN echo "#!/bin/bash" > /etc/ \
&& chmod 777 /etc/
# This entry point will be inherited by all derived images
ENTRYPOINT ["/etc/"]

All we’re doing here is specifying that we want to start with a base Ubuntu 14.04 image (which will be downloaded from the repository if necessary) and then we create an empty shell script called /etc/ that is defined as an entrypoint for this image.

In order to build the image,  we have to run the docker build command:

docker build -t acme/ubuntu_base_14_04 bases/ubuntu_14_04

As described in the first part of the tutorial, the -t argument defines the tag for the newly created image and the last argument is the path to the Dockerfile to be used. In our case, the tag has a namespace – acme – representing a fictitious company, Acme Inc. This tag is what we’ll use in the FROM directive for the images that will be derived from the one we just built.

3. Mesos

3.1 Overview

Mesos is a cluster manager, i.e. a service that coordinates the distribution of a cluster’s computing resources among  tasks spawned by one or more applications. It is one of several managers that Spark can currently use, the others being Spark’s own manager – called standalone – and YARN. Without going into the details here, the reasons why I chose Mesos over the others are that it seems to be the more flexible and future-proof choice.

3.2 The Docker images

3.2.1 The Mesos base image

This section will walk you through the creation of the Mesos base image and give you a bit of background why the Dockerfile looks a bit intimidating.

Currently, if you want to install Mesos from the project’s site, the only possibilty is to download the source code and compile it. I did that initially, but then I ran into some issues when running the master. Since I didn’t have time to figure out what the problem was, I discovered in the Spark docummentation that Mesos can also be installed from binaries provided by Mesosphere.

Now that you know the story behind it, here is the Dockerfile:

FROM acme/base_ubuntu_14_04
### Install mesos from mesosphere
# Set up key
RUN apt-key adv --keyserver --recv E56151BF
# Add the repository
RUN DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]') \
&& CODENAME=$(lsb_release -cs) \
&& echo "deb${DISTRO} ${CODENAME} main" | \
tee /etc/apt/sources.list.d/mesosphere.list
RUN apt-get -y update
#Install mesos
RUN apt-get -y install mesos
#Set mesos env
ENV MESOS_WORK_DIR=/var/lib/mesos
ENV MESOS_LOG_DIR=/var/log/mesos

All we’re really doing is adding Mesosphere’s repository and then installing Mesos using apt-get.

3.2.2 The Mesos master image

Here is the Dockerfile for the Mesos master image:

FROM acme/mesos_base
COPY /opt/
#Add this layer's bootstrap to /etc/
RUN cat /opt/ >> /etc/
RUN rm /opt/

The file is a one liner which simply launches the Mesos master:
/usr/bin/mesos master

And finally, here is the script I use to run the container:

docker run -d -e "MESOS_IP=${_MESOS_MASTER_IP}" -e "MESOS_HOSTNAME=${_MESOS_HOSTNAME}" --name "mesos_master" --net="host" acme/mesos_master

Through trail and error, I discovered that only 2 among many environment variables where mandatory to run the Mesos master: MESOS_IP and MESOS_HOSTNAME. This post pointed me in the right direction:

The syntax of the docker run command was presented in the first part of the tutorial. Also in the first part, I promised that I will provide more details about how we can configure a container to reuse the Docker host’s IP address instead of a randomly allocated one. That goal is acheived by supplying the --net="host" argument to the docker run command.

As aluded in the fist post, in my intial iteration I used $_DOCKER_IP instead of $_MESOS_MASTER_IP (which are both set to the IP address of the Docker host) and boot2docker (a string, not a var) instead of $_MESOS_HOSTNAME. Both approaches work but the second will allow a smoother transition for a multi-host cluster.

3.2.3 The Mesos slave image

The slave’s Dockerfile is identical to the master’s so I won’t include it again.

The file is as follows:

/usr/bin/mesos slave $1

This script lauches the mesos slave and expects the first argument passed to the script to define the master node.

Let’s turn to the run script to see how this argument is passed in:

docker run -d --name "mesos_slave1" acme/mesos_slave --master="${_MESOS_MASTER_IP}":5050

If you recall from the first part, any arguments that follow the name of the image are supplied as arguments to the entrypoint, which in our case is the bootstrap script we just saw.

3.3 Testing

Now that you have a master and slave containers running, you can open the master’s web UI running on your docker host on port 5050 and check that it’s running and that in the left column you have one active slave.

4. Hadoop

4.1 Relation to Spark

I have to admit that the relation between Spark and Hadoop wasn’t initially clear to me. On one side, I was learning that Spark didn’t require Hadoop to run, but on the other, when downloading Spark, I had to choose a Hadoop version. As it turns out, the reason for this dependency is Spark’s support of a lot of input sources, one of the prominent ones being HDFS due to its ubiquity in big data environments.

Another use for HDFS can be to host the archive that contains the Spark executor, which is the framework that runs on the Spark nodes and executes the tasks. This archive will be downloaded by each Mesos slave in order to assume the role of a Spark node. We’ll see the details of how this is configured when we’ll get to the configuration of Spark.

4.2 The Docker images

4.2.1 The Hadoop base image

Let’s start with the Dockerfile:

FROM acme/base_ubuntu_14_04
RUN apt-get update -y
RUN apt-get install -y wget ssh rsync openjdk-7-jre
#current version: hadoop-2.7.1
COPY hadoop.tar.gz /opt/hadoop.tar.gz
RUN cd /opt \
&& tar -xzf hadoop.tar.gz \
&& rm hadoop.tar.gz \
&& mv hadoop-* hadoop
ENV JAVA_HOME /usr/lib/jvm/java-1.7.0-openjdk-amd64
ENV HADOOP_HOME /opt/hadoop
RUN useradd hadoop
RUN echo 'hadoop:password' | chpasswd
RUN mkdir /home/hadoop
RUN chown -R hadoop:hadoop /opt/hadoop
RUN chown -R hadoop:hadoop /home/hadoop
COPY core-site.xml /opt/hadoop/etc/hadoop/core-site.xml
COPY hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml
COPY /opt/
#Add this layer's bootstrap to /etc/
RUN cat /opt/ >> /etc/
RUN rm /opt/

We need to install ssh, rsync, the JVM (OpenJDK 7) and finally Hadoop 2.7.1. We also need to set the JAVA_HOME and HADOOP_HOME variables and to create an unprivileged user and make it the owner of the Hadoop installation directory because we can’t run Hadoop as the root user (which is the default for Docker containers).

As can be seen in the Dockerfile, we need to supply a couple of configuration files in order to run Hadoop. The first one is hdfs-site.xml (I skipped the initial boilerplate):

  <!-- Don't do reverse DNS checks (for the test cluster) -->

What’s important to note in this file is that:

  • we’re storing the Hadoop data files in /home/hadoop so we had to create this folder in the Dockerfile
  • I had to set ip-hostname-check to false, otherwise Hadoop would attempt to resolve the datanodes’ addresses to hostnames (and fail since we didn’t bother with DNS)

The second configuration file is core-site.xml:


In this file, the string _HADOOP_NAMENODE_IP is a placeholder, which will be replaced with the IP address of the namenode in the bootstrap script:

sed -i "s/_HADOOP_NAMENODE_IP/$_HADOOP_NAMENODE_IP/g" /opt/hadoop/etc/hadoop/core-site.xml

The HADOOP_NAMENODE_IP env var must be passed in when calling docker run on the namenode or datanode images and in this scenario will be the IP of the Docker host.

4.2.2 The Hadoop namenode image

The Dockerfile for the namenode should look familiar by now:

FROM acme/hadoop_base
COPY /opt/
#Add this layer's bootstrap to /etc/
RUN cat /opt/ >> /etc/
RUN rm /opt/

In the we provide the commands to format HDFS and start the namenode:

$HADOOP_HOME/bin/hdfs namenode -format spark_test_cluster
$HADOOP_HOME/sbin/ start namenode
while true; do sleep 1000; done

The reason for the last line in the script was given in the first part of the tutorial.

Lastly, here are the contents of the docker run file:

docker run -d -u hadoop -e "USER=hadoop" --name "hadoop_namenode" \
--net=host -e "_HADOOP_NAMENODE_IP=${_HADOOP_NAMENODE_IP}" acme/hadoop_namenode

Specific to this command are the options to set the user to “hadoop” and also the USER environment variable (which doesn’t happen automatically as I assumed). We also need to set the _HADOOP_NAMENODE_IP variable.

4.2.3 The Hadoop datanode image

The datanode’s Dockerfile is identical to the namenode’s one.

In we start the datanode and we copy the Spark archive to HDFS. This is the Spark tarball that you can download from the Spark project’s site, I just removed the version from the file name since I’m experimenting with different versions.

$HADOOP_HOME/sbin/ start datanode
/opt/hadoop/bin/hadoop fs -put /volume/spark.tgz /spark.tgz
while true; do sleep 1000; done

Finally, here is the docker run command:

docker run -d -u hadoop -e USER=hadoop --name "hadoop_datanode1" \
-v "$_DOCKER_SHARED_DIR/spark:/volume" --net="host" \
-e "_HADOOP_NAMENODE_IP=${_HADOOP_NAMENODE_IP}" acme/hadoop_datanode

What’s notable with this command is that we’re mounting a folder in the image (as /volume) and that we’re using again the host’s network stack. In my initial version of the run script I went with the default network configuration and that worked fine on a single machine pseudo-cluster but it didn’t work when I switched to a physical cluster.

4.3 Testing

If you’ve launched the Hadoop namenode and datanode containers, you can check the Hadoop web UI on the Docker host machine on port 50070 to check that the nodes are up and that the Spark tarball was copied to HDFS.

With this second part of the tutorial completed, we got everything in place to install and run Spark on our virtualized cluster. That will be the subject of the 3rd part of this series.