Skip to main content

Book on How to Cluster some Pis with Hadoop

To be honest and straightforward I expected more from a book with title like Raspberry Pi Super Cluster. The author Andrew K. Denis has a very clear vision on the subject (like in his previous book Raspberry Pi Home Automation with Arduino, which I liked a lot). He's done his best to deliver an exhaustive set-up while being concise at the same time, but it seems to me, this clearly is the wrong format for a book on the given topic.

Stack Pis for parallel power
Now having this book at hand, I finally got the chance to answer many of the questions I had about clustering, and how it can be applied to a set of Raspberry Pis. The first impression is that it is very well structured and gradual. Lets see, the first two chapters are short introductions to parallel computing (background history and the contemporary systems) and the initial set-up respectively. They're short and to the point. And that's the way it should be - it is presumed that if you're going parallel, then you're somewhat advanced tinkerer already. 
Actually the second chapter is pretty abundant in details on how to install the operating system, the required software and tools. I skimmed through it, because I already had the two Raspberry Pi units pretty well equipped with what was needed. 

The next chapter is the first encounter with a parallel software in the face of MPICH - one of the oldest and most widely adopted implementations of the MPI (Message Passing Interface) implementations, which is designed for applications written in C, C++ or Fortran. In this chapter we also come to one tricky part - setting up of the second (equally applicable to third, fourth, and so on) Raspberry Pi unit. It is tricky because it's a continuation of the set-up started in the second chapter and must be followed strictly. Especially the part with the RSA keys exchange. If you get just one thing wrong, you may have to start all over (like myself). The good news is that the procedure is short and not as much obscure, as one can imagine for a set-up concerning security matters. Once the berries are prepared correctly, the only thing you'll care further on would be the parallel frameworks and your applications.

At chapter four, after we've calculated the number Pi with a small MPICH application written in C, we finally arrive at one of the most popular representatives of the modern trends in parallel software - Apache Hadoop. Its installation is quick, but the configuration is a bit detailed, especially when you take into account that most of the things have to be done at least twice. Here I met the biggest downside of the book - the lack of any trouble shooting for the situations when you get stuck. Although I followed every step verbatim, there were errors logged on the console, for which there was no help around. Fortunately the messages are somewhat self explanatory, so with little deduction one can get to the next step fairly easy.
Another disadvantage, if I may call it that, is Hadoop's version. I don't know when the book was written, but when it says "Download the latest version", on the project's site you get version 2.2.0, while for the book this is version 1.2.1. This wouldn't be be much of a problem, if Hadoop's architecture hasn't been changed significantly. So if you prefer the latest, the instructions in the chapter are of no use for you. If the author had a good reason to stick to the older branch of the software, this reason remains obscure to the reader.
There are few lesser inaccuracies like wrong documentation URL, not whole scp commands, and a sense of text that was a bit too rushed.

Now having the framework for parallel computing already set, it is time to test it with an application. Since Hadoop is written in Java (as a typical Apache project), its main target implementation language is Java. (You might follow the book and install the JDK 7 from the repositories, or you may try the new JDK 8, as I described in my previous post.) The test application is counting some words from an input file and is not particularly interesting, but gives a simple and comprehensible introduction to the MapReduce concept. More interesting is the Monte Carlo algorithm's approach described in the sixth chapter. The good thing is that it is compared side by side with analogous C program for MPI. This actually is the culmination and the essence of the book. For further investigations of the concepts and ways to apply the parallelism in practice, help is available online. Many resources are given in the appendix.

The last chapter is quite handy in general and beyond the scope of the book. The instructions for booting the Raspberry Pi with an external USB HDD as an auxiliary data storage seem very useful. The building of LEGO case for the cluster, and the suggestions for alternative energy sources give interesting views to Raspberry Pi on their own.

All in all setting up a cluster form Raspberry Pi units is shown to be not so complex as expected. Only the correct set of steps should be followed, and followed strictly at times. If not giving a hint to certain project, this book at least puts you in a firm starting position on the road to parallelism.

Comments

Popular posts from this blog

The Pi as a PostgreSQL Database Server

Raspbian with PostgreSQL it is quite easy actually. Just like on Ubuntu/Linux Mint/... (Replace the ellipsis with any derivative of Debian or Ubuntu.) The hardest part is to decide which version of the database server to employ. On this page the full set of options for retrieving the server is given with the necessary amount of detail.     "Should I get it?" Actually, since PostgreSQL (together with MySQL) is one of the most popular open source databases within the Linux realm, some distributions choose to deliver it pre-installed on their releases. If you are not sure, if you need to get the server at all, this simple command can answer that question: $ ps aux | grep postgers It will search through the processes running on your system and filter them to leave only those bound to PostgreSQL. It is possible that the server is present on the system, but it is not running at the moment. In that case it is enough to see if its configurations are in place. The plac

Java 8 on the Raspberry Pi

This topic being approached exhaustively may become vast and is fit for at least a book. I'll have to keep it short and concise here, so I'll stick to a few key points: Java Runtime vs JDK - actually there is no discussion here - if you you intend to run programming projects you need the development kit, period. (It contains the runtime anyway.) Java 7  vs Java 8 (JDKs) - this could require some debate. Java 7 is the mature and default option to go with. Having around two years in production, it is the safer choice. Java 8 has been just released, and its shortcomings are still unknown. On the other hand Java 8 has numerous improvements to the language, and Oracle wouldn't approve it for release if it wasn't quite well tested. Another facet to be considered is that Java 7 is well presented in the repositories, while currently Java 8 have to be downloaded, installed and maintained (the regular updates - mostly for security reasons) all manually. Source examples - ne