High-Performance Distributed Computing

Document Type



Dr. Greg Wolffe, wolffe@gvsu.edu

Embargo Period



This project involved learning and using a new, open-source, high-performance distributed computing framework called Hadoop. As it is a relatively recent release from Yahoo, the first step in the process was researching this cutting-edge technology. Since the project was intended to be a complete investigation from hardware to results, the next step was to setup a distributed computing platform using a blade server powered by the Ubuntu operating system. The infrastructure stage was completed with the installation and configuration of the Hadoop framework and filesystem. The next step, learning and using the features of the framework, was approached by writing a simple Hadoop application that implemented the well-known Traveling Salesman Problem. This naturally helped in learning the basics of distributed computing using Hadoop, although it did not stress the file handling capabilities of the system. To test that aspect, a second and much more complex application was developed. This represented a social networking research application that performed data mining on a large set – data from the Wikipedia website. The final step involved gathering metrics to show the improvement in execution time of the distributed Hadoop applications against their serial versions. This project was very time-consuming and complex, but offered numerous learning opportunities. There were quite a few problems that had to be overcome, ranging from hardware issues to language incompatibilities. This afforded a deep reflection on the project, the framework, and lessons learned.

This document is currently not available here.