This is the first in a series of posts in which I describe the steps I had to take in order to run a prototype Spark cluster on my development machine (which happens to be a MacBook). I will cover first how to use Docker to create a virtualized cluster infrastructure, then how to install Mesos, Spark, Hadoop and Cassandra and how to test that the ensemble works as expected.

1. Docker

1.1 Overview

Since we’ll be using Docker to create our virtualized infrastructure, this section will give a brief overview of this technology for those who aren’t yet familiar with it.

The core concept of Docker is the container, which can be viewed as a lightweight virtual machine. This means that we have the experience of using virtualized Linux machines, where we can install and run software, but the container doesn’t virtualize the entire operating system. This is possible because the containers share a lot of resources with the host OS and only certain OS capabilities like the file system or the network interfaces are indeed virtualized. The major benefit of this reuse is that, on a single host, we can run significantly more Docker containers in parallel than VMs, which in turn means that those containers can be made more granular and composable.

VM vs Docker container
Figure 1: Implementation of virtual machines vs Docker containers

In Docker terminology, a container is in fact a running instance of a Docker image, which corresponds to what we usually call a virtual machine. An image is usually built from a base image that we either created previously or downloaded from a repository and then we customized by adding software and altering configuration options. The best practice when creating Docker images is to use a Dockerfile to describe the steps that need to be performed to create the image. To those familiar with Vagrant, a Dockerfile is a very similat concept to a Vagrantfile.

1.2 Specificities

Docker has a few specificities and limitations that we must be aware of:

  • there is a single entry point (i.e. initial running process) in a container, although this can be a script that launches multiple other processes. 
  • containers are allocated a random IP address on every run. The quick workaround if we require static IP addresses is to configure the container to use the host’s network stack which ensures that the container has the same IP address as the host.
  • Docker doesn’t run natively on Mac OS or Windows since it’s based on Linux containers. In order to get around this limitation we can use boot2docker, which can be used to run Docker containers within a lightweight VirtualBox Linux VM. After launching boot2docker you can use the docker commands seamlessly from your host’s command line terminal.
  • with release 1.8, boot2docker was deprecated in favor of docker-machine. I recently tried docker-machine and, although I didn’t have any problem initially, when I attempted to test that the Spark cluster still worked the test failed. I will explain the reason why this happened in the appropriate section (and I think it’s just a configuration issue), but I do want to make you aware that it happened and I reverted to using boot2docker.
  • one issue I ran into was that my boot2docker VM ran out of memory when running too many containers in parallel. I followed the steps described in this post to resize my VM: https://coderwall.com/p/w-zk9w/increase-boot2docker-space-allocation-in-osx 
  • update 1: if you install Docker on ubuntu, follow the instructions on the docker site instead of using apt-get (e.g. apt-get install docker.io). I ran into issues when building images by installing the docker.io package.

1.3 Commands

Docker comes with a command line tool called (you guessed it) docker, which is used to create images, start and stop containers etc. Here is a shortlist of the most important Docker commands and a some of the arguments they support:

1. docker build [-t <tag>] <path>

This command builds a Docker image based on the Dockerfile located at the specified path. The convention for the tag is to specify a namespace and an image name separated by a slash (e.g. my_company/spark).The tags you specify when building an image are used in Dockerfiles of derived images (in the FROM directive).

2. docker ps [-a]

The ps command shows the containers that are currently running. The -a argument shows all containers, whether running or not.

3. docker run [-it] [-d] [-e <ENV_VAR>=<value>] [-v <path_on_host:path_in_container>] [--rm] [--name <container_name>] 
--entrypoint <image_name> [<command>] [<args>]

This command starts a container. It takes many parameters so I listed the ones I use more often:

  • -i and -t are usually used together when we want to interact with the container (e.g. through a bash shell). I normally use these arguments when I’m “debugging” an image (along with --rm)
  • -d is used to run the container in the background which basically means that you won’t see any output from the container. I use this option when I’ve figured out the container’s configuration and I don’t need a console access to it
  • --rm is used to automatically remove the container after it exits and is mutually exclusive with -d
  • -e is used to set up environment variables in the container
  • -v is used to mount a local directory (if you’re using boot2docker note that this is a directory on your machine and not the boot2docer VM) as a directory in the container
  • --name is used to set a name for the container which can be used subsequently in other docker commands (else a name is randomly generated)
  • after the argument list we need to specify the name of the Docker image that we want to run
  • optionally we can specify a command to run (such as bash). Note that if an entry point was specified for the container, all parameters following the name of the image are passed in as arguments for the entry point
  • finally we can pass in some arguments for the container’s command or entry point.
4. docker stop

This command stops a running container.

5. docker exec [-it] <image_name> <command>

This command allows us to run commands in a running container. I usually use it to run a bash shell.

6. docker start <container_name>

This command is used to start a stopped container (whereas ‘docker run’ creates a container). This is useful when we want to preserve the state of a container from one run to the next. 

There’s obviously a lot more to Docker than I could squeeze in this tutorial. Some of the most useful links when working with Docker are the following:

1.4 Conventions

Here are some conventions that I adopted while working on this prototype: 

a) I’m using an ubuntu 14.04 image as a base for all my images (downloaded from the Docker repository)

b) in terms of bootstrapping the containers, I’m usingusing a script called /etc/bootstrap.sh as an entry point and each new derived image appends to this bootstrap file (I discovered this method while examining SequenceIQ‘s images on the Docker repository)

c) for the services that have more than one instance type (e.g. Mesos, which has master and slave instances), I usually create a base image which contains the artifacts and then one image for each instance type which is derived from base. The base image can be compared to a base class in OOP terminology from which we can derive concrete class, which are meant to be instantiated. I also created an overall base image for all my images which contains the (empty) bootstrap script and defines it at the entry point

d) for each image I also usually create a run.sh script to capture the docker run command’s arguments. So, to summarize, for each image I have a dedicated directory containing the Dockerfile, the bootstrap script, any files that need to be inserted in the image (as we’ll see later) and a run.sh file. Except for the Dockerfile, all the other files are optional. Contrary to the other files, the run.sh file may be placed elsewhere.

e) some of the services required for this prototype will need the IP address or the hostname of another service as arguments. Since I’m trying to keep the environment as simple as possible and I don’t want to run a DNS server or to update the hosts file on several containers after each restart, I decided to have all the services that need to be exposed to use the host’s IP address and hostname (which is the boot2docker VM if you use that). I will explain how this is done when we encounter the first such case. 

f) tied to the previous item, I defined a few environment variables on my physical machine to simplify the configuration of the containers such as _DOCKER_IP which is the IP address part of DOCKER_HOST and _DOCKER_SHARED_DIRS which is a path to the directories I mount in Docker containers. (DOCKER_HOST is an env variable that you must have set in order to use the docker client). I’m using the underscore prefix because I once chose the same name for a variable that Mesos uses and I lost hours before I figured out what was wrong.

update 2: After I started to deploy the containers on more than one machine, using the _DOCKER_IP variable obvioulsy didn’t make sense anymore as there were more than one Docker hosts. Instead I started defining a variable for each service such as _MESOS_MASTER_IP or _HADOOP_NAMENODE_IP. For the purposes of running the cluster on your development machine both approaches work. I decided however to use the the second one in the rest of this tutorial as it’s more descriptive and prepares the way for deployment on a physical cluster.

g) I installed all software in the /opt folder of my images for consistency (when not using apt).

h) Docker containers stop when the entry point process exits. When running certain containers, they exit immediately, although the process launched from my bootstrap script still runs and I presume this happens because the process is run in the background, which in turn causes the script itself to exit. A trick that I found in SequenceIQ’s scripts that prevents the container’s premature exit is to add the following snippet at the end of the bootstrap script:

while true; do sleep 1000; done

i) there are 2 ways of installing software in Docker images (in addition to apt-get): either download the package, as a step of building the image, or download the package on your machine and use the Dockerfile COPY command to place it in the image. I tend to prefer the 2nd option although it requires to place the files in the Dockerfile’s directory and they can’t even be soft links. The benefit is that if you need to rebuild that image often, you only have to download the package once.


 With these technicalities about Docker out of the way, we can start building our cluster. In the second part of this tutorial I will cover the creation of Mesos and Hadoop images, then we’ll run the containers and make sure that everything is running as expected.

Advertisements