MCMC on the Parallella

Introduction

A very, very, very long time ago, I backed an interesting looking Kickstarter project called Parallella. The idea, somewhat inspired by the Raspberry Pi project, was to enable a small hardware company, Adapteva, to create a small (credit card sized), cheap, single board computer. The difference with the Raspberry Pi would be that this computer would have a more powerful CPU, and would also have an on-board FPGA and a 16 (or 64) core co-processor, called the Epiphany. In fact, the whole project is really a way to kick-start the production and development of the Epiphany chip, but that isn’t necessarily the main interest of the backers, many of whom are just geeky hobbyists like myself. The project was funded (I backed it at the $99 level), but then went fairly quiet for around 2 years(!) until I got a message from UK Customs telling me I had to pay import duty on my board… The board arrived earlier this year, but I’ve had lots of problems getting it going, so it’s only now that I have used it enough to warrant writing a blog post.

The tag-line of the Kickstarter project was “A Supercomputer For Everyone“, and that certainly attracted my interest. I do a lot of computationally intensive work, and have dabbled, on-and-off, with parallel, multi-core, and distributed computing for over 15 years. To have a serious parallel computer for not much more than the cost of two Raspberry Pis was just too good an opportunity to pass up on. The Parallella board looks very much like a Raspberry Pi, and is similar to set up and use, but generally a lot more flaky and fiddly – if you are new to single board computers, I would strongly recommend getting up to speed with a Raspberry Pi model B+ first! The Pi is just a bit simpler, easier, more robust, and has a large, well-organised, supportive user community. I had various issues with faulty/incompatible adaptor cables, cheap SD cards, and a home router which refused to give it an IP address… I won’t bore everyone with all of the details, but I am now in a position where I can get it to boot up and run well enough to be able to begin to put the thing through its paces.

There is a recommended Linux distro for the Parallella, based on Linaro, which, like Raspbian, is a Debian derived Linux for ARM. Just like the Pi (B+), you write the OS to a micro SD card and then use that as the boot media. Typing cat /proc/cpuinfo reveals two cores of an ARM (v7) processor. The distro comes with various dev tools pre-installed (git, make, gcc, etc.), as well as some example code to get you started, so it’s actually pretty quick to get going with once the thing boots and runs.

Benchmarking C code on the ARM chip

To start off with I’m just looking at the dual core ARM processor – I will start and look at the Epiphany chip as time permits. To begin with, I benchmarked the processor using my standard C-based Gibbs sampling script. For completeness, the source code of gibbs.c is given below:

/*
gcc -O4 gibbs.c -lgsl -lgslcblas -lm -o gibbs
time ./gibbs > datac.tab
*/

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
 
void main()
{
  int N=50000;
  int thin=1000;
  int i,j;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  double x=0;
  double y=0;
  printf("Iter x y\n");
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
      y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(2*x+2));
    }
    printf("%d %f %f\n",i,x,y);
  }
}

This code requires the GSL, which isn’t installed by default, but a simple sudo apt-get install libgsl0-dev soon fixes that (I love Debian-derived Linux distros). The code runs in about 75 seconds on (a single core of) the Parallella. This is about 10 times slower than my (very fancy) laptop. However, it’s about twice as fast as the Pi (which only has one core), and around the same speed as my (Intel Atom based) netbook. So, given that the Parallella has two cores, this very superficial and limited experiment suggests that the ARM chip on the Parallella is around 4 times as powerful as the ARM chip in the Raspberry Pi. Of course, this is ignoring the FPGA and the Epiphany chip, but even without these, you are arguably getting more bang for your buck (and more importantly, more bang for your Watt) with the Parallella than you do with the Pi (though without the FPGA and Epiphany, there wouldn’t be much in it).

Scala code

These days I use Scala in preference to C whenever I possibly can. So, I was also interested to see how well Scala runs on this board. Scala is JVM based, and the Parallella distro has the OpenJDK JVM pre-installed, so if you are an sbt person (and if you use Scala you should be), it’s just a matter of copying the sbt launcher to ~/bin and you are good to go. Again, for completeness, a simple (non-idiomatic!) Scala program equivalent to the C version above is given below:

/*
Gibbs2.scala

time sbt run > gibbs.dat
*/

object Gibbs2 {
 
    import java.util.Date
    import scala.math.sqrt
    import breeze.stats.distributions._
 
    def main(args: Array[String]) {
        val N=50000
        val thin=1000
        var x=0.0
        var y=0.0
        println("Iter x y")
        for (i <- 0 until N) {
            for (j <- 0 until thin) {
                x=new Gamma(3.0,y*y+4).draw
                y=new Gaussian(1.0/(x+1),1.0/sqrt(2*x+2)).draw
            }
            println(i+" "+x+" "+y)
        }
    }
 
}

An appropriate build.sbt file for resolving dependencies is given below:

name := "Gibbs2"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalacheck" %% "scalacheck" % "1.11.4" % "test",
            "org.scalatest" %% "scalatest" % "2.1.7" % "test",
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.1"

This Scala version of the code takes around 30 minutes to complete on the Parallella! This is pretty dire, but entirely expected. Scala runs on the JVM, and the OpenJDK JVM is known to perform very poorly on ARM architectures. If you want to run computationally intensive JVM code on ARM, you have no choice but to get the Oracle JDK8 for ARM. Although this isn’t real “free” software, it can be downloaded and used for free for personal use.

Using Oracle’s JVM

Download and unpack Oracle’s JVM somewhere sensible on your system. Then update your sbt launcher script to point at the new java binary location. You may also want to add a javaHome line to your sbt build script. Once this is done, using sbt is just as normal, but you get to use Oracle’s pretty fast ARM JVM. Running the above example with the new JVM takes around 4 minutes 40 seconds on my Parallella. This is more than 5 times faster than the OpenJDK JVM, and it is a similar story on the Raspberry Pi as well. But it still isn’t blazingly fast. On Intel this code runs within a factor of 2 of the C version. Here it is closer to a factor of 4 slower. But again, this is very similar to the situation on the Pi. The Oracle JVM is good, but for whatever reason, the ARM JVM doesn’t get as close to the speed of native code as it does on x86.

A simple Monte Carlo example in Scala

I recently gave a talk on using Scala for statistical computing, and for that I prepared a talk and some code examples. One of the examples was a simple Monte Carlo example, coded in several different ways, including one version which would automatically exploit multiple cores. There is a github repo containing the slides of the talk and all of the code examples. Running the Monte Carlo example (with 10^6 iterations) using the OpenJDK JVM produces the following results:

$ sbt run
[info] Set current project to monte-carlo (in build file:/home/linaro/git/statslang-scala/monte-carlo/)
[info] Running MonteCarlo 
Running with 1000000 iterations
Idiomatic vectorised solution
0.99498
time: 37922.050389ms
Fast efficient (serial) tail call
0.999875
time: 25327.705376ms
Parallelised version
0.99997
time: 15633.49492ms
Done
[success]

Again, using the OpenJDK leads to poor performance, but the important thing to note is that the multi-core version is nearly twice as fast as the single threaded version, due to the Parallel ARM chip having 2 cores. Re-running with the Oracle JVM leads to much better results:

$ sbt run
[info] Set current project to monte-carlo (in build file:/home/linaro/git/statslang-scala/monte-carlo/)
[info] Running MonteCarlo 
Running with 1000000 iterations
Idiomatic vectorised solution
1.002605
time: 8214.310728ms
Fast efficient (serial) tail call
0.997675
time: 2410.760695ms
Parallelised version
0.999525
time: 1800.324405ms
Done
[success] 

Here, the Oracle JVM is almost 10 times faster than the OpenJDK JVM, and again the parallel version runs significantly faster than the serial version.

A Parallella gotcha to be aware of here is that I got weird certificate errors when cloning my github repos that turned out to be the clock. Like the Pi, the Parallella doesn’t have an RTC chip built in, but unlike the Pi, it doesn’t always seem to automatically sync when an internet connection comes up. Manually running ntpdate fixed the date and then git worked fine…

Summary

The Parallella packs a lot of processing power in a very small, low power form-factor. Even just using the main ARM chip, it is a better starting point for computationally intensive workflows than the Raspberry Pi (provided that you can get it to work!). If you can find a way to exploit the Epiphany chip, things will look even better. Java and other JVM languages (eg. Scala) are fine on the Parallella, but only if you use Oracle’s JVM, otherwise you are going to have very poor performance. My next task is to have a play with the Epiphany examples that are bundled with the Parallella distro. At the moment it seems like the best way to use the Epiphany chip is via C+OpenCL. That involves a degree of masochism I tend not to subject myself to very often these days, so it may have to wait until I’m in an appropriate mood!

Advertisements

Published by

darrenjw

I am Professor of Stochastic Modelling within the School of Mathematics & Statistics at Newcastle University, UK. I am also a computational systems biologist.

2 thoughts on “MCMC on the Parallella”

  1. I don’t remember where exactly it was, but I read that the Parallella board has a quirk where they changed the clock speed but didn’t modify the other things, so it thinks time is running faster than it is. IIRC, the base clock used to be 30 MHz and was changed to 33 MHz, so after it runs for a while, it will be out of sync with actual time, and you’ll run into issues with that. Also, it means it will lie about its own benchmarks (in its favor – it thinks it took 33 seconds for every 30).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s