Book on How to Cluster some Pis with Hadoop

To be honest and straightforward I expected more from a book with title like Raspberry Pi Super Cluster. The author Andrew K. Denis has a very clear vision on the subject (like in his previous book Raspberry Pi Home Automation with Arduino, which I liked a lot). He's done his best to deliver an exhaustive set-up while being concise at the same time, but it seems to me, this clearly is the wrong format for a book on the given topic.

Stack Pis for parallel power
Now having this book at hand, I finally got the chance to answer many of the questions I had about clustering, and how it can be applied to a set of Raspberry Pis. The first impression is that it is very well structured and gradual. Lets see, the first two chapters are short introductions to parallel computing (background history and the contemporary systems) and the initial set-up respectively. They're short and to the point. And that's the way it should be - it is presumed that if you're going parallel, then you're somewhat advanced tinkerer already. 
Actually the second chapter is pretty abundant in details on how to install the operating system, the required software and tools. I skimmed through it, because I already had the two Raspberry Pi units pretty well equipped with what was needed. 

The next chapter is the first encounter with a parallel software in the face of MPICH - one of the oldest and most widely adopted implementations of the MPI (Message Passing Interface) implementations, which is designed for applications written in C, C++ or Fortran. In this chapter we also come to one tricky part - setting up of the second (equally applicable to third, fourth, and so on) Raspberry Pi unit. It is tricky because it's a continuation of the set-up started in the second chapter and must be followed strictly. Especially the part with the RSA keys exchange. If you get just one thing wrong, you may have to start all over (like myself). The good news is that the procedure is short and not as much obscure, as one can imagine for a set-up concerning security matters. Once the berries are prepared correctly, the only thing you'll care further on would be the parallel frameworks and your applications.

At chapter four, after we've calculated the number Pi with a small MPICH application written in C, we finally arrive at one of the most popular representatives of the modern trends in parallel software - Apache Hadoop. Its installation is quick, but the configuration is a bit detailed, especially when you take into account that most of the things have to be done at least twice. Here I met the biggest downside of the book - the lack of any trouble shooting for the situations when you get stuck. Although I followed every step verbatim, there were errors logged on the console, for which there was no help around. Fortunately the messages are somewhat self explanatory, so with little deduction one can get to the next step fairly easy.
Another disadvantage, if I may call it that, is Hadoop's version. I don't know when the book was written, but when it says "Download the latest version", on the project's site you get version 2.2.0, while for the book this is version 1.2.1. This wouldn't be be much of a problem, if Hadoop's architecture hasn't been changed significantly. So if you prefer the latest, the instructions in the chapter are of no use for you. If the author had a good reason to stick to the older branch of the software, this reason remains obscure to the reader.
There are few lesser inaccuracies like wrong documentation URL, not whole scp commands, and a sense of text that was a bit too rushed.

Now having the framework for parallel computing already set, it is time to test it with an application. Since Hadoop is written in Java (as a typical Apache project), its main target implementation language is Java. (You might follow the book and install the JDK 7 from the repositories, or you may try the new JDK 8, as I described in my previous post.) The test application is counting some words from an input file and is not particularly interesting, but gives a simple and comprehensible introduction to the MapReduce concept. More interesting is the Monte Carlo algorithm's approach described in the sixth chapter. The good thing is that it is compared side by side with analogous C program for MPI. This actually is the culmination and the essence of the book. For further investigations of the concepts and ways to apply the parallelism in practice, help is available online. Many resources are given in the appendix.

The last chapter is quite handy in general and beyond the scope of the book. The instructions for booting the Raspberry Pi with an external USB HDD as an auxiliary data storage seem very useful. The building of LEGO case for the cluster, and the suggestions for alternative energy sources give interesting views to Raspberry Pi on their own.

All in all setting up a cluster form Raspberry Pi units is shown to be not so complex as expected. Only the correct set of steps should be followed, and followed strictly at times. If not giving a hint to certain project, this book at least puts you in a firm starting position on the road to parallelism.

Comments

Popular Posts