## Setting up a standalone Apache Spark cluster of Raspberry Pi 2

In the previous post I explained how to install Apache Spark in “local” mode on a Raspberry Pi 2. In this post I will explain how to link together a collection of such nodes into a standalone Apache Spark cluster. Here, “standalone” refers to the fact that Spark is managing the cluster itself, and that it is not running on top of Hadoop or some other cluster management solution.

I will assume at least two Raspberry Pi 2 nodes on the same local network, with identical Spark distributions installed in the same directory of the same user account on each node. See the previous post for instructions on how to do this. I will use two such nodes, with Spark installed under the user account spark.

First, you must decide on one of your nodes to be the master. I have two nodes, raspi08 and raspi09. I will set up raspi08 as the master. The spark account on the master needs to be able to SSH into the same account on all of the other nodes without the need to provide a password, so it makes sense to begin by setting up passwordless SSH. Log in to the spark account on the master and generate SSH keys with:

ssh-keygen


Just press return when asked for a password to keep it password free. Copy the identity to each other node. eg. to copy it to raspi09 I use:

ssh-copy-id spark@raspi09


You will obviously need to provide a password at this point. Once the identity is copied, SSH into each node to ensure that you can indeed connect without the need for a password.

Once passwordless SSH is set up and working, log into the master node to configure the Spark installation. Within the conf/ directory of the Spark installation, create a file called slaves and enter a list of all nodes you want to have as “workers”. eg. mine looks like:

raspi08
raspi09


Note that in my case raspi08 is listed as a worker despite also acting as master. That is perfectly possible. If you have plenty of nodes you might not want to do this, as the Pi 2 doesn’t really have quite enough RAM to be able to do this well, but since I have only two nodes, it seems like a good idea here.

Also within the conf/ directory, take a copy of the environment template:

cp spark-env.sh.template spark-env.sh


and then edit spark-env.sh according to your needs. I solved some obtuse Akka errors about workers not being able to connect back to the master by hardcoding the IP address of the Master node into the config file:

SPARK_MASTER_IP=192.168.1.177


You can find out the IP address of your node by running ifconfig. You should also set the memory that can be used by each worker node. This is a bit tricky, as RAM is a bit tight on the Pi 2. I went for 512MB, leaving nearly half a gig for the OS.

SPARK_WORKER_MEMORY=512m


Once you are done, the environment template (but not the slaves list) needs to be copied to each worker. eg.

scp spark-env.sh @raspi09:spark-1.3.0-bin-hadoop2.4/conf/


You shouldn’t need to supply a password…

At this point, you should be ready to bring up the cluster. You can bring up the master and workers all in one go with:

sbin/start-all.sh


When the master comes up, it starts a web service on port 8080. eg. I connect to it from any machine on the local network by pointing my browser at: http://raspi08:8080/

If a web page comes up, then the master is running. The page should give other diagnostic information, including a list of workers that have been brought up and are registered with the master. You can also access a lot of debugging info. If everything seems to look OK, make a note of the URL for the Spark master, which is displayed in large text at the top of the page. It should just be spark://192.168.1.177:7077 where the IP address is the IP address of your master node.

Try bringing up a spark shell on the master with:

bin/spark-shell --master spark://192.168.1.177:7077


Once the shell comes up, go back to your web browser and refresh the page to see the connection. Go back to the shell and try a simple test like:

sc.textFile("README.md").count


As usual, it is Ctrl-D to exit the shell. To bring down the cluster, use

sbin/stop-all.sh


Once you are happy that everything is working as it should, you probably want to reduce the amount of diagnostic debugging info that is echoed to the console. Do this by going back into the conf/ directory and copying the log4j template:

cp log4j.properties.template log4j.properties


and then editing log4j.properties. There is a line near the beginning of the file:

log4j.rootCategory=INFO, console


Change INFO to WARN so it reads:

log4j.rootCategory=WARN, console


Then when you next bring up the cluster, everything should be a bit less noisy.

That’s it. You’ve built a Spark cluster! Note that when accessing files from Spark scripts (and applications) it is assumed that the file exists in the same directory on every worker node. For testing purposes, it is easy enough to use scp before running the script to copy the files to all of the workers. But that is obviously somewhat unsatisfactory in the long term. Another possibility is to set up an NFS file server and mount it at the same mount point on each worker. Then make sure that any files you access are shared via the NFS file server. Even that solution isn’t totally satisfactory, due to the slow interconnect on the Pi 2. Ultimately, it would be better to set up a proper distributed file system such as Hadoop’s HDFS on your cluster and then share files via HDFS. That is how most production Spark clusters are set up. I may look at that in another post, but in the meantime check out the Spark standalone documentation for further information.

## Installing Apache Spark on a Raspberry Pi 2

In this post I will walk through how to install and run Apache Spark on a Raspberry Pi 2. In the next post I will show how to make a Spark cluster from multiple (two!) Raspberry Pi 2. Very little of this is actually specific to the Raspberry Pi – most of this post will apply to any Linux system with a recent Java JVM (7 or 8) and Python 2.x installed. Also note that “installation” of Spark is really just unpacking tarball – it doesn’t need a “system-wide” installation or root access, so it’s not a big deal – you can just trash it later if you decide you don’t want it.

Apache Spark is a Scala library for the analysis of “big data”. Spark clusters are used for a huge range of tasks in big data analytics. It is something that every (aspiring) data scientist needs to know about. Spark runs fine on a (cluster of) Raspberry Pi 2. It does not run well on any Pi prior to the Raspberry Pi 2. Spark really needs at least a gig of RAM and multiple cores to be useful. Even a cluster of Raspberry Pi 2 will struggle for “real” big data applications due to it’s limited RAM per node and slow networking. That said, the Pi 2 provides a simple and cheap way of learning about how this powerful technology works.

Raspbian provides an ideal base on which to install Spark – it comes with Oracle’s ARM-optimised JDK8 and Python 2.7.x pre-installed on the default image. So, starting from a Raspbian install, go straight to the Spark downloads page and download a pre-built package. It doesn’t really matter exactly which version, but it should be a 1.3.x release, and probably pre-built for a recent Hadoop. eg. spark-1.3.0-bin-hadoop2.4.tgz, which is the version I used.

It is possible to install and run Spark from any user account, including the default “pi” account. This will be fine if you only have one Pi 2 and you are just going to run Spark on a single node in “local” mode. However, if you have multiple Pi 2, and you want to build any kind of Spark cluster, then it is really much better to create a dedicated user account (with the same name) on each Pi. With that in mind, I recommend creating a new user account. eg.

sudo adduser spark


Once created, log out and back in to the new user account, and then unpack Spark into the new home directory:

tar xvfz ~pi/spark-1.3.0-bin-hadoop2.4.tgz


Obviously, update the path to wherever you downloaded the tarball. Basically, that’s it! Test the installation with:

bin/run-example SparkPi 10


This will churn away for several seconds dumping tons of junk to the console, but buried among it should be a (poor) approximation to $\pi$. Assuming that works, next test that the Spark shell works:

bin/spark-shell --master local[4]


This will start the shell in local mode with 4 cores. Try entering:

sc.textFile("README.md").count


at the prompt to do a simple file line count. Note that Spark starts a diagnostic web server on port 4040, so while the shell is running, point a web browser at it. eg. I connect to http://raspi08.home:4040/ from any machine on my home network. This is the Spark UI, which is very useful for debugging purposes.

When you are done, Ctrl-D to exit the shell. The dumping of logging information to the console is pretty annoying. It’s possible to turn it off. I’ll explain how to do it in the next post – I recommend putting up with it until your cluster is set up and working properly, as it does contain a lot of very useful debugging info.

Next, try the pySpark shell:

bin/pyspark --master local[4]


The python test code is a bit more verbose than the Scala version 😉

sc.textFile("README.md").count()


Again, Ctrl-D to exit.

So, that is Spark installed and working in “local” mode on a single node. In the next post I’ll show how to create a simple Spark “standalone” cluster using two Raspbery Pi 2. For this, I will start by assuming that Spark has been installed under the same user account in the same directory on every Pi 2. For further information about Spark, start with the official Spark documentation overview.

Below are some links I found useful in preparing this post (and the next one):

https://spark.apache.org/docs/latest/spark-standalone.html

https://docs.sigmoidanalytics.com/index.php/Installing_Spark_and_Setting_Up_Your_Cluster

http://mbonaci.github.io/mbo-spark/

http://jugnu-life.blogspot.com/2013/08/spark-standalone-mode-installation-steps.html

http://ju-jutsu.com/wp/?p=550

## Benchmarking MCMC codes on the Raspberry Pi 2

### Introduction

In the previous post I looked at running some MCMC codes in C and Scala on the Parallella. In that post I explained how the Parallella was significantly faster than the Raspberry Pi, and how it represented better “bang for buck” than the Raspberry Pi for computationally intensive MCMC codes. However, since that post was written, the Raspberry Pi 2 has been released. This board has a much better processor than the old Pi, and double the RAM, for the same price. This changes things, considerably. The processor is an ARM7 quad core. Each core is around twice as fast as the single core on the original Pi, and there are 4 of them. In this post I will re-run the codes from the previous post and compare against the Parallella.

### Gibbs sampler in C

I’m using the new Raspbian image for the Pi 2. This includes gcc by default, but not the GSL library. This can be installed with sudo apt-get install libgsl0-dev. Then the file gibbs.c from the previous post will compile and run. On the Pi 2 this runs in around 75 seconds – very similar to the time on the Parallella, and around twice as fast as all of the previous Raspberry Pis.

### Gibbs sampler in Scala

The Raspbian image ships with Oracle’s fast and efficient ARM-optimised JVM by default, so there’s no issue with installing Java at all. As usual, installing “sbt” is a simple matter of copying the launcher script (and jar) into your ~/bin directory. Then the Scala version of the Gibbs sampler can be run with a command like time sbt run > /dev/null. Again, it runs in around 4 minutes 40 seconds, just like on the Parallella. So, the ARM cores on the Parallella and the Pi 2 have very similar performance. However, the Parallella ARM chip has just two cores, whereas the Pi 2 is quad core.

### Parallel Monte Carlo in Scala

Again, as for the previous post, I next ran the Monte Carlo example from this github repo. This gives output like:

$sbt run [info] Set current project to monte-carlo (in build file:/home/pi/src/git/statslang-scala/monte-carlo/) [info] Running MonteCarlo Running with 1000000 iterations Idiomatic vectorised solution 1.00136 time: 6768.504487ms Fast efficient (serial) tail call 1.000705 time: 2473.331672ms Parallelised version 1.0007 time: 1391.2828ms Done [success]  Here again we see that the single threaded versions run in a similar time to the Parallella (and around twice as fast as the old Pis), but that the parallelised version runs significantly faster on the Pi 2 than on the Parallella (due to having 4 ARM cores rather than 2). ### Summary For my test MCMC codes, the cores on the Pi 2 are around twice as fast as the single core on the old Raspberry Pis, and a similar speed to the cores on the Parallella. However, multi-threaded codes run faster still, due to there being 4 cores on the Pi 2 (versus 2 on the Parallella and one on the old Pis). Furthermore, the Pi 2 is the same price as the old Pis (which are still being sold), and around a quarter of the price of the cheapest Parallella. So for standard single and multi-threaded codes running on the ARM cores, the Pi 2 wins hands down in terms of “bang for buck”, and is sufficiently quick and cheap that it starts looking like a credible platform for building cheap clusters for compute-intensive jobs like MCMC. Now to be fair to the Parallella, really the whole point of it is that it has a multi-core Epiphany co-processor that I’ve not been using or factoring in to the discussion at all so far. That said, the Pi 2 is so much cheaper than the Parallella (not to mention, less “fragile”), that I suspect that even for codes which effectively exploit the Epiphany chip it is unlikely that the Parallella will outperform the Pi 2 in terms of “bang for buck”. Now “bang per watt” is another matter entirely, and the Parallella may well outperform the Pi 2 in that regard if efficient use can be made of the Epiphany chip. But development time costs money too, and it’s really not clear that it’s going to be easy for me to run my multi-threaded Scala codes effectively on the Epiphany chip any time soon. So the Pi 2 currently looks like a real winner from my personal perspective. ## MCMC on the Parallella ### Introduction A very, very, very long time ago, I backed an interesting looking Kickstarter project called Parallella. The idea, somewhat inspired by the Raspberry Pi project, was to enable a small hardware company, Adapteva, to create a small (credit card sized), cheap, single board computer. The difference with the Raspberry Pi would be that this computer would have a more powerful CPU, and would also have an on-board FPGA and a 16 (or 64) core co-processor, called the Epiphany. In fact, the whole project is really a way to kick-start the production and development of the Epiphany chip, but that isn’t necessarily the main interest of the backers, many of whom are just geeky hobbyists like myself. The project was funded (I backed it at the$99 level), but then went fairly quiet for around 2 years(!) until I got a message from UK Customs telling me I had to pay import duty on my board… The board arrived earlier this year, but I’ve had lots of problems getting it going, so it’s only now that I have used it enough to warrant writing a blog post.

The tag-line of the Kickstarter project was “A Supercomputer For Everyone“, and that certainly attracted my interest. I do a lot of computationally intensive work, and have dabbled, on-and-off, with parallel, multi-core, and distributed computing for over 15 years. To have a serious parallel computer for not much more than the cost of two Raspberry Pis was just too good an opportunity to pass up on. The Parallella board looks very much like a Raspberry Pi, and is similar to set up and use, but generally a lot more flaky and fiddly – if you are new to single board computers, I would strongly recommend getting up to speed with a Raspberry Pi model B+ first! The Pi is just a bit simpler, easier, more robust, and has a large, well-organised, supportive user community. I had various issues with faulty/incompatible adaptor cables, cheap SD cards, and a home router which refused to give it an IP address… I won’t bore everyone with all of the details, but I am now in a position where I can get it to boot up and run well enough to be able to begin to put the thing through its paces.

There is a recommended Linux distro for the Parallella, based on Linaro, which, like Raspbian, is a Debian derived Linux for ARM. Just like the Pi (B+), you write the OS to a micro SD card and then use that as the boot media. Typing cat /proc/cpuinfo reveals two cores of an ARM (v7) processor. The distro comes with various dev tools pre-installed (git, make, gcc, etc.), as well as some example code to get you started, so it’s actually pretty quick to get going with once the thing boots and runs.

### Benchmarking C code on the ARM chip

To start off with I’m just looking at the dual core ARM processor – I will start and look at the Epiphany chip as time permits. To begin with, I benchmarked the processor using my standard C-based Gibbs sampling script. For completeness, the source code of gibbs.c is given below:

/*
gcc -O4 gibbs.c -lgsl -lgslcblas -lm -o gibbs
time ./gibbs > datac.tab
*/

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

void main()
{
int N=50000;
int thin=1000;
int i,j;
gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
double x=0;
double y=0;
printf("Iter x y\n");
for (i=0;i<N;i++) {
for (j=0;j<thin;j++) {
x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(2*x+2));
}
printf("%d %f %f\n",i,x,y);
}
}


This code requires the GSL, which isn’t installed by default, but a simple sudo apt-get install libgsl0-dev soon fixes that (I love Debian-derived Linux distros). The code runs in about 75 seconds on (a single core of) the Parallella. This is about 10 times slower than my (very fancy) laptop. However, it’s about twice as fast as the Pi (which only has one core), and around the same speed as my (Intel Atom based) netbook. So, given that the Parallella has two cores, this very superficial and limited experiment suggests that the ARM chip on the Parallella is around 4 times as powerful as the ARM chip in the Raspberry Pi. Of course, this is ignoring the FPGA and the Epiphany chip, but even without these, you are arguably getting more bang for your buck (and more importantly, more bang for your Watt) with the Parallella than you do with the Pi (though without the FPGA and Epiphany, there wouldn’t be much in it).

### Scala code

These days I use Scala in preference to C whenever I possibly can. So, I was also interested to see how well Scala runs on this board. Scala is JVM based, and the Parallella distro has the OpenJDK JVM pre-installed, so if you are an sbt person (and if you use Scala you should be), it’s just a matter of copying the sbt launcher to ~/bin and you are good to go. Again, for completeness, a simple (non-idiomatic!) Scala program equivalent to the C version above is given below:

/*
Gibbs2.scala

time sbt run > gibbs.dat
*/

object Gibbs2 {

import java.util.Date
import scala.math.sqrt
import breeze.stats.distributions._

def main(args: Array[String]) {
val N=50000
val thin=1000
var x=0.0
var y=0.0
println("Iter x y")
for (i <- 0 until N) {
for (j <- 0 until thin) {
x=new Gamma(3.0,y*y+4).draw
y=new Gaussian(1.0/(x+1),1.0/sqrt(2*x+2)).draw
}
println(i+" "+x+" "+y)
}
}

}


An appropriate build.sbt file for resolving dependencies is given below:

name := "Gibbs2"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
"org.scalacheck" %% "scalacheck" % "1.11.4" % "test",
"org.scalatest" %% "scalatest" % "2.1.7" % "test",
"org.scalanlp" %% "breeze" % "0.10",
"org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
"Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
"Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.1"


This Scala version of the code takes around 30 minutes to complete on the Parallella! This is pretty dire, but entirely expected. Scala runs on the JVM, and the OpenJDK JVM is known to perform very poorly on ARM architectures. If you want to run computationally intensive JVM code on ARM, you have no choice but to get the Oracle JDK8 for ARM. Although this isn’t real “free” software, it can be downloaded and used for free for personal use.

#### Using Oracle’s JVM

Download and unpack Oracle’s JVM somewhere sensible on your system. Then update your sbt launcher script to point at the new java binary location. You may also want to add a javaHome line to your sbt build script. Once this is done, using sbt is just as normal, but you get to use Oracle’s pretty fast ARM JVM. Running the above example with the new JVM takes around 4 minutes 40 seconds on my Parallella. This is more than 5 times faster than the OpenJDK JVM, and it is a similar story on the Raspberry Pi as well. But it still isn’t blazingly fast. On Intel this code runs within a factor of 2 of the C version. Here it is closer to a factor of 4 slower. But again, this is very similar to the situation on the Pi. The Oracle JVM is good, but for whatever reason, the ARM JVM doesn’t get as close to the speed of native code as it does on x86.

#### A simple Monte Carlo example in Scala

I recently gave a talk on using Scala for statistical computing, and for that I prepared a talk and some code examples. One of the examples was a simple Monte Carlo example, coded in several different ways, including one version which would automatically exploit multiple cores. There is a github repo containing the slides of the talk and all of the code examples. Running the Monte Carlo example (with 10^6 iterations) using the OpenJDK JVM produces the following results:

$sbt run [info] Set current project to monte-carlo (in build file:/home/linaro/git/statslang-scala/monte-carlo/) [info] Running MonteCarlo Running with 1000000 iterations Idiomatic vectorised solution 0.99498 time: 37922.050389ms Fast efficient (serial) tail call 0.999875 time: 25327.705376ms Parallelised version 0.99997 time: 15633.49492ms Done [success]  Again, using the OpenJDK leads to poor performance, but the important thing to note is that the multi-core version is nearly twice as fast as the single threaded version, due to the Parallel ARM chip having 2 cores. Re-running with the Oracle JVM leads to much better results: $ sbt run
[info] Set current project to monte-carlo (in build file:/home/linaro/git/statslang-scala/monte-carlo/)
[info] Running MonteCarlo
Running with 1000000 iterations
Idiomatic vectorised solution
1.002605
time: 8214.310728ms
Fast efficient (serial) tail call
0.997675
time: 2410.760695ms
Parallelised version
0.999525
time: 1800.324405ms
Done
[success]


Here, the Oracle JVM is almost 10 times faster than the OpenJDK JVM, and again the parallel version runs significantly faster than the serial version.

A Parallella gotcha to be aware of here is that I got weird certificate errors when cloning my github repos that turned out to be the clock. Like the Pi, the Parallella doesn’t have an RTC chip built in, but unlike the Pi, it doesn’t always seem to automatically sync when an internet connection comes up. Manually running ntpdate fixed the date and then git worked fine…

### Summary

The Parallella packs a lot of processing power in a very small, low power form-factor. Even just using the main ARM chip, it is a better starting point for computationally intensive workflows than the Raspberry Pi (provided that you can get it to work!). If you can find a way to exploit the Epiphany chip, things will look even better. Java and other JVM languages (eg. Scala) are fine on the Parallella, but only if you use Oracle’s JVM, otherwise you are going to have very poor performance. My next task is to have a play with the Epiphany examples that are bundled with the Parallella distro. At the moment it seems like the best way to use the Epiphany chip is via C+OpenCL. That involves a degree of masochism I tend not to subject myself to very often these days, so it may have to wait until I’m in an appropriate mood!