In this post I will walk through how to install and run Apache Spark on a Raspberry Pi 2. In the next post I will show how to make a Spark cluster from multiple (two!) Raspberry Pi 2. Very little of this is actually specific to the Raspberry Pi – most of this post will apply to any Linux system with a recent Java JVM (7 or 8) and Python 2.x installed. Also note that “installation” of Spark is really just unpacking tarball – it doesn’t need a “system-wide” installation or root access, so it’s not a big deal – you can just trash it later if you decide you don’t want it.
Apache Spark is a Scala library for the analysis of “big data”. Spark clusters are used for a huge range of tasks in big data analytics. It is something that every (aspiring) data scientist needs to know about. Spark runs fine on a (cluster of) Raspberry Pi 2. It does not run well on any Pi prior to the Raspberry Pi 2. Spark really needs at least a gig of RAM and multiple cores to be useful. Even a cluster of Raspberry Pi 2 will struggle for “real” big data applications due to it’s limited RAM per node and slow networking. That said, the Pi 2 provides a simple and cheap way of learning about how this powerful technology works.
Raspbian provides an ideal base on which to install Spark – it comes with Oracle’s ARM-optimised JDK8 and Python 2.7.x pre-installed on the default image. So, starting from a Raspbian install, go straight to the Spark downloads page and download a pre-built package. It doesn’t really matter exactly which version, but it should be a 1.3.x release, and probably pre-built for a recent Hadoop. eg. spark-1.3.0-bin-hadoop2.4.tgz, which is the version I used.
It is possible to install and run Spark from any user account, including the default “pi” account. This will be fine if you only have one Pi 2 and you are just going to run Spark on a single node in “local” mode. However, if you have multiple Pi 2, and you want to build any kind of Spark cluster, then it is really much better to create a dedicated user account (with the same name) on each Pi. With that in mind, I recommend creating a new user account. eg.
sudo adduser spark
Once created, log out and back in to the new user account, and then unpack Spark into the new home directory:
tar xvfz ~pi/spark-1.3.0-bin-hadoop2.4.tgz cd spark-1.3.0-bin-hadoop2.4
Obviously, update the path to wherever you downloaded the tarball. Basically, that’s it! Test the installation with:
bin/run-example SparkPi 10
This will churn away for several seconds dumping tons of junk to the console, but buried among it should be a (poor) approximation to . Assuming that works, next test that the Spark shell works:
bin/spark-shell --master local
This will start the shell in local mode with 4 cores. Try entering:
at the prompt to do a simple file line count. Note that Spark starts a diagnostic web server on port 4040, so while the shell is running, point a web browser at it. eg. I connect to http://raspi08.home:4040/ from any machine on my home network. This is the Spark UI, which is very useful for debugging purposes.
When you are done, Ctrl-D to exit the shell. The dumping of logging information to the console is pretty annoying. It’s possible to turn it off. I’ll explain how to do it in the next post – I recommend putting up with it until your cluster is set up and working properly, as it does contain a lot of very useful debugging info.
Next, try the pySpark shell:
bin/pyspark --master local
The python test code is a bit more verbose than the Scala version 😉
Again, Ctrl-D to exit.
So, that is Spark installed and working in “local” mode on a single node. In the next post I’ll show how to create a simple Spark “standalone” cluster using two Raspbery Pi 2. For this, I will start by assuming that Spark has been installed under the same user account in the same directory on every Pi 2. For further information about Spark, start with the official Spark documentation overview.
Below are some links I found useful in preparing this post (and the next one):