In Part 1 of this series, I discussed some of the basic concepts around Hadoop, specifically when it’s appropriate to use Hadoop to solve your data engineering problems and the terminology of the Hadoop eco-system. This post will cover how to install your own Hadoop cluster on Amazon EC2 using Cloudera Manager.
Like prior posts talking about Amazon EC2, this post assumes you have some basic facility with Linux, submitting instructions via the command line, etc. Because really, if you’re interested in Hadoop, using the command line probably isn’t a limiting factor!
Building a 18-node Hadoop Cluster
The SlideShare presentation below shows the steps to building a 18-node Hadoop cluster, using a single m1.large EC2 instance as the ‘Name Node’ and 18 m1.medium EC2 instances as the ‘Data Nodes’. I chose 18 nodes because according to Cloudera, 20 is the maximum that can be activated at one time through the Amazon API, so let’s stay under the max to avoid any errors. It’s possible to add more instances later through the Cloudera Manager (up to 50 total), if so desired.
Note that going through this tutorial will cost $2.40/hr at current prices ($0.24/hr per m1.large instance and $0.12/hr per m1.medium instance).
Since the SlideShare presentation is potentially not so friendly on the eyes, I’ve also created a PDF download that’s full resolution.
Next Steps
Once you make it through all these steps to set up a Hadoop cluster, you are ready to do some analysis. Part 3 of this tutorial will cover how to upload data into HDFS using Hue.
Update, 7/13/13: As is the case with any open-source project, there have been several changes to the Cloudera Manager that makes setup easier. When getting started, on the screen where it asks “Which Cloudera do you want to deploy?”, choose ‘Cloudera Standard’. Also, once you get to slides 13-14 where you click on the link to get started with Hue, the link now works correctly (you don’t need to search for the Amazon DNS any more!)