In the previous post I explained how to install Apache Spark in “local” mode on a Raspberry Pi 2. In this post I will explain how to link together a collection of such nodes into a standalone Apache Spark cluster. Here, “standalone” refers to the fact that Spark is managing the cluster itself, and that it is not running on top of Hadoop or some other cluster management solution.
I will assume at least two Raspberry Pi 2 nodes on the same local network, with identical Spark distributions installed in the same directory of the same user account on each node. See the previous post for instructions on how to do this. I will use two such nodes, with Spark installed under the user account spark.
First, you must decide on one of your nodes to be the master. I have two nodes, raspi08 and raspi09. I will set up raspi08 as the master. The spark account on the master needs to be able to SSH into the same account on all of the other nodes without the need to provide a password, so it makes sense to begin by setting up passwordless SSH. Log in to the spark account on the master and generate SSH keys with:
Just press return when asked for a password to keep it password free. Copy the identity to each other node. eg. to copy it to raspi09 I use:
You will obviously need to provide a password at this point. Once the identity is copied, SSH into each node to ensure that you can indeed connect without the need for a password.
Once passwordless SSH is set up and working, log into the master node to configure the Spark installation. Within the conf/ directory of the Spark installation, create a file called slaves and enter a list of all nodes you want to have as “workers”. eg. mine looks like:
Note that in my case raspi08 is listed as a worker despite also acting as master. That is perfectly possible. If you have plenty of nodes you might not want to do this, as the Pi 2 doesn’t really have quite enough RAM to be able to do this well, but since I have only two nodes, it seems like a good idea here.
Also within the conf/ directory, take a copy of the environment template:
cp spark-env.sh.template spark-env.sh
and then edit spark-env.sh according to your needs. I solved some obtuse Akka errors about workers not being able to connect back to the master by hardcoding the IP address of the Master node into the config file:
You can find out the IP address of your node by running ifconfig. You should also set the memory that can be used by each worker node. This is a bit tricky, as RAM is a bit tight on the Pi 2. I went for 512MB, leaving nearly half a gig for the OS.
Once you are done, the environment template (but not the slaves list) needs to be copied to each worker. eg.
scp spark-env.sh @raspi09:spark-1.3.0-bin-hadoop2.4/conf/
You shouldn’t need to supply a password…
At this point, you should be ready to bring up the cluster. You can bring up the master and workers all in one go with:
When the master comes up, it starts a web service on port 8080. eg. I connect to it from any machine on the local network by pointing my browser at: http://raspi08:8080/
If a web page comes up, then the master is running. The page should give other diagnostic information, including a list of workers that have been brought up and are registered with the master. You can also access a lot of debugging info. If everything seems to look OK, make a note of the URL for the Spark master, which is displayed in large text at the top of the page. It should just be spark://192.168.1.177:7077 where the IP address is the IP address of your master node.
Try bringing up a spark shell on the master with:
bin/spark-shell --master spark://192.168.1.177:7077
Once the shell comes up, go back to your web browser and refresh the page to see the connection. Go back to the shell and try a simple test like:
As usual, it is Ctrl-D to exit the shell. To bring down the cluster, use
Once you are happy that everything is working as it should, you probably want to reduce the amount of diagnostic debugging info that is echoed to the console. Do this by going back into the conf/ directory and copying the log4j template:
cp log4j.properties.template log4j.properties
and then editing log4j.properties. There is a line near the beginning of the file:
Change INFO to WARN so it reads:
Then when you next bring up the cluster, everything should be a bit less noisy.
That’s it. You’ve built a Spark cluster! Note that when accessing files from Spark scripts (and applications) it is assumed that the file exists in the same directory on every worker node. For testing purposes, it is easy enough to use scp before running the script to copy the files to all of the workers. But that is obviously somewhat unsatisfactory in the long term. Another possibility is to set up an NFS file server and mount it at the same mount point on each worker. Then make sure that any files you access are shared via the NFS file server. Even that solution isn’t totally satisfactory, due to the slow interconnect on the Pi 2. Ultimately, it would be better to set up a proper distributed file system such as Hadoop’s HDFS on your cluster and then share files via HDFS. That is how most production Spark clusters are set up. I may look at that in another post, but in the meantime check out the Spark standalone documentation for further information.