Setting up a standalone Apache Spark cluster of Raspberry Pi 2

In the previous post I explained how to install Apache Spark in “local” mode on a Raspberry Pi 2. In this post I will explain how to link together a collection of such nodes into a standalone Apache Spark cluster. Here, “standalone” refers to the fact that Spark is managing the cluster itself, and that it is not running on top of Hadoop or some other cluster management solution.

I will assume at least two Raspberry Pi 2 nodes on the same local network, with identical Spark distributions installed in the same directory of the same user account on each node. See the previous post for instructions on how to do this. I will use two such nodes, with Spark installed under the user account spark.

First, you must decide on one of your nodes to be the master. I have two nodes, raspi08 and raspi09. I will set up raspi08 as the master. The spark account on the master needs to be able to SSH into the same account on all of the other nodes without the need to provide a password, so it makes sense to begin by setting up passwordless SSH. Log in to the spark account on the master and generate SSH keys with:

ssh-keygen

Just press return when asked for a password to keep it password free. Copy the identity to each other node. eg. to copy it to raspi09 I use:

ssh-copy-id spark@raspi09

You will obviously need to provide a password at this point. Once the identity is copied, SSH into each node to ensure that you can indeed connect without the need for a password.

Once passwordless SSH is set up and working, log into the master node to configure the Spark installation. Within the conf/ directory of the Spark installation, create a file called slaves and enter a list of all nodes you want to have as “workers”. eg. mine looks like:

raspi08
raspi09

Note that in my case raspi08 is listed as a worker despite also acting as master. That is perfectly possible. If you have plenty of nodes you might not want to do this, as the Pi 2 doesn’t really have quite enough RAM to be able to do this well, but since I have only two nodes, it seems like a good idea here.

Also within the conf/ directory, take a copy of the environment template:

cp spark-env.sh.template spark-env.sh

and then edit spark-env.sh according to your needs. I solved some obtuse Akka errors about workers not being able to connect back to the master by hardcoding the IP address of the Master node into the config file:

SPARK_MASTER_IP=192.168.1.177

You can find out the IP address of your node by running ifconfig. You should also set the memory that can be used by each worker node. This is a bit tricky, as RAM is a bit tight on the Pi 2. I went for 512MB, leaving nearly half a gig for the OS.

SPARK_WORKER_MEMORY=512m

Once you are done, the environment template (but not the slaves list) needs to be copied to each worker. eg.

scp spark-env.sh @raspi09:spark-1.3.0-bin-hadoop2.4/conf/

You shouldn’t need to supply a password…

At this point, you should be ready to bring up the cluster. You can bring up the master and workers all in one go with:

sbin/start-all.sh

When the master comes up, it starts a web service on port 8080. eg. I connect to it from any machine on the local network by pointing my browser at: http://raspi08:8080/

If a web page comes up, then the master is running. The page should give other diagnostic information, including a list of workers that have been brought up and are registered with the master. You can also access a lot of debugging info. If everything seems to look OK, make a note of the URL for the Spark master, which is displayed in large text at the top of the page. It should just be spark://192.168.1.177:7077 where the IP address is the IP address of your master node.

Try bringing up a spark shell on the master with:

bin/spark-shell --master spark://192.168.1.177:7077

Once the shell comes up, go back to your web browser and refresh the page to see the connection. Go back to the shell and try a simple test like:

sc.textFile("README.md").count

As usual, it is Ctrl-D to exit the shell. To bring down the cluster, use

sbin/stop-all.sh

Once you are happy that everything is working as it should, you probably want to reduce the amount of diagnostic debugging info that is echoed to the console. Do this by going back into the conf/ directory and copying the log4j template:

cp log4j.properties.template log4j.properties

and then editing log4j.properties. There is a line near the beginning of the file:

log4j.rootCategory=INFO, console

Change INFO to WARN so it reads:

log4j.rootCategory=WARN, console

Then when you next bring up the cluster, everything should be a bit less noisy.

That’s it. You’ve built a Spark cluster! Note that when accessing files from Spark scripts (and applications) it is assumed that the file exists in the same directory on every worker node. For testing purposes, it is easy enough to use scp before running the script to copy the files to all of the workers. But that is obviously somewhat unsatisfactory in the long term. Another possibility is to set up an NFS file server and mount it at the same mount point on each worker. Then make sure that any files you access are shared via the NFS file server. Even that solution isn’t totally satisfactory, due to the slow interconnect on the Pi 2. Ultimately, it would be better to set up a proper distributed file system such as Hadoop’s HDFS on your cluster and then share files via HDFS. That is how most production Spark clusters are set up. I may look at that in another post, but in the meantime check out the Spark standalone documentation for further information.

Installing Apache Spark on a Raspberry Pi 2

In this post I will walk through how to install and run Apache Spark on a Raspberry Pi 2. In the next post I will show how to make a Spark cluster from multiple (two!) Raspberry Pi 2. Very little of this is actually specific to the Raspberry Pi – most of this post will apply to any Linux system with a recent Java JVM (7 or 8) and Python 2.x installed. Also note that “installation” of Spark is really just unpacking tarball – it doesn’t need a “system-wide” installation or root access, so it’s not a big deal – you can just trash it later if you decide you don’t want it.

Apache Spark is a Scala library for the analysis of “big data”. Spark clusters are used for a huge range of tasks in big data analytics. It is something that every (aspiring) data scientist needs to know about. Spark runs fine on a (cluster of) Raspberry Pi 2. It does not run well on any Pi prior to the Raspberry Pi 2. Spark really needs at least a gig of RAM and multiple cores to be useful. Even a cluster of Raspberry Pi 2 will struggle for “real” big data applications due to it’s limited RAM per node and slow networking. That said, the Pi 2 provides a simple and cheap way of learning about how this powerful technology works.

Raspbian provides an ideal base on which to install Spark – it comes with Oracle’s ARM-optimised JDK8 and Python 2.7.x pre-installed on the default image. So, starting from a Raspbian install, go straight to the Spark downloads page and download a pre-built package. It doesn’t really matter exactly which version, but it should be a 1.3.x release, and probably pre-built for a recent Hadoop. eg. spark-1.3.0-bin-hadoop2.4.tgz, which is the version I used.

It is possible to install and run Spark from any user account, including the default “pi” account. This will be fine if you only have one Pi 2 and you are just going to run Spark on a single node in “local” mode. However, if you have multiple Pi 2, and you want to build any kind of Spark cluster, then it is really much better to create a dedicated user account (with the same name) on each Pi. With that in mind, I recommend creating a new user account. eg.

sudo adduser spark

Once created, log out and back in to the new user account, and then unpack Spark into the new home directory:

tar xvfz ~pi/spark-1.3.0-bin-hadoop2.4.tgz
cd spark-1.3.0-bin-hadoop2.4

Obviously, update the path to wherever you downloaded the tarball. Basically, that’s it! Test the installation with:

bin/run-example SparkPi 10

This will churn away for several seconds dumping tons of junk to the console, but buried among it should be a (poor) approximation to \pi. Assuming that works, next test that the Spark shell works:

bin/spark-shell --master local[4]

This will start the shell in local mode with 4 cores. Try entering:

sc.textFile("README.md").count

at the prompt to do a simple file line count. Note that Spark starts a diagnostic web server on port 4040, so while the shell is running, point a web browser at it. eg. I connect to http://raspi08.home:4040/ from any machine on my home network. This is the Spark UI, which is very useful for debugging purposes.

When you are done, Ctrl-D to exit the shell. The dumping of logging information to the console is pretty annoying. It’s possible to turn it off. I’ll explain how to do it in the next post – I recommend putting up with it until your cluster is set up and working properly, as it does contain a lot of very useful debugging info.

Next, try the pySpark shell:

bin/pyspark --master local[4]

The python test code is a bit more verbose than the Scala version 😉

sc.textFile("README.md").count()

Again, Ctrl-D to exit.

So, that is Spark installed and working in “local” mode on a single node. In the next post I’ll show how to create a simple Spark “standalone” cluster using two Raspbery Pi 2. For this, I will start by assuming that Spark has been installed under the same user account in the same directory on every Pi 2. For further information about Spark, start with the official Spark documentation overview.

Useful links

Below are some links I found useful in preparing this post (and the next one):

https://spark.apache.org/docs/latest/spark-standalone.html

https://docs.sigmoidanalytics.com/index.php/Installing_Spark_and_Setting_Up_Your_Cluster

http://mbonaci.github.io/mbo-spark/

http://jugnu-life.blogspot.com/2013/08/spark-standalone-mode-installation-steps.html

http://ju-jutsu.com/wp/?p=550