Getting Started Using Hadoop, Part 2: Building a Cluster

In Part 1 of this series, I discussed some of the basic concepts around Hadoop, specifically when it’s appropriate to use Hadoop to solve your data engineering problems and the terminology of the Hadoop eco-system. This post will cover how to install your own Hadoop cluster on Amazon EC2 using Cloudera Manager.

Like prior posts talking about Amazon EC2, this post assumes you have some basic facility with Linux, submitting instructions via the command line, etc. Because really, if you’re interested in Hadoop, using the command line probably isn’t a limiting factor!

Building a 18-node Hadoop Cluster

The SlideShare presentation below shows the steps to building a 18-node Hadoop cluster, using a single m1.large EC2 instance as the ‘Name Node’ and 18 m1.medium EC2 instances as the ‘Data Nodes’.  I chose 18 nodes because according to Cloudera, 20 is the maximum that can be activated at one time through the Amazon API, so let’s stay under the max to avoid any errors. It’s possible to add more instances later through the Cloudera Manager (up to 50 total), if so desired.

Note that going through this tutorial will cost $2.40/hr at current prices ($0.24/hr per m1.large instance and $0.12/hr per m1.medium instance).

Since the SlideShare presentation is potentially not so friendly on the eyes, I’ve also created a PDF download that’s full resolution.


Next Steps

Once you make it through all these steps to set up a Hadoop cluster, you are ready to do some analysis. Part 3 of this tutorial will cover how to upload data into HDFS using Hue.

 

Update, 7/13/13: As is the case with any open-source project, there have been several changes to the Cloudera Manager that makes setup easier. When getting started, on the screen where it asks “Which Cloudera do you want to deploy?”, choose ‘Cloudera Standard’. Also, once you get to slides 13-14 where you click on the link to get started with Hue, the link now works correctly (you don’t need to search for the Amazon DNS any more!)

Comments

  1. minnaaralex says:

    I followed your instructions though I was unable to access the Hue web UI. Could you suggest why this might be?

    • minnaaralex Without seeing your setup, it’s hard to say exactly, but I would make sure that your firewall settings are correct and that you are connecting via port 8888 on the Amazon public DNS link.

  2. naterstd says:

    Hi,
    Your instructions work perfectly up until slide 12. I received no errors until then and the inspection passed on all hosts. However, it says: “Server Error    No hosts found” instead of starting up my services. Do you have any idea what could cause this? The other instructions have been followed exactly, except I am only using 9 nodes

    • naterstd says:

      It also gives me warnings that my recommended heap size is too small for host monitors:

      “The recommended heap size that supports a time series cache for a period of 370 minutes is at least 163052680 bytes, only 123403263 are configured.”

      • naterstd Unfortunately, I can’t be much help here. I just re-ran the instructions with 9 nodes and the process worked as expected.
        The only things I can think of are to make sure you picked the 64-bit 12.04 LTS instance type and check your firewall settings. For the firewall settings, you could try just opening all of the ports; generally, this is a bad idea for security, but for testing the process, it’s a good way to verify if there’s something weird going on in the settings.
        Good luck!

        • naterstd Actually, I also got the problem about the Java heap size, but mine is in the configuration area after the cluster got started up correctly. So it appears that this setting is too low in the Cloudera Manager setup. 
          It’s an easy fix, assuming the cluster starts up (there’s a link to reset to default value (which is 1GB) in the mgmt1 screen), but I don’t see how you can change that setting during the build process.

        • naterstd says:

          randyzwitch Before I try to run again, I was wondering:
          1. Is 12.04 LTS the same as “Ubuntu Server 12.04.2 LTS” that appears in the quick start options, or do I need to select 12.04LTS out of the AWS Marketplace?
          2. When you re-ran with 9 nodes, did you re-download the Cloudera installer to get the most recent one or did you use one you already had installed (so you would be using one of the older versions here: http://archive.cloudera.com/cm4/installer/ )

        • naterstd Yes, the 12.04LTS is the same one as is in the Classic Wizard. I also used the newest Cloudera installer as well.
          The only thing I re-used was my key-pair and firewall settings, which *should* be the same as what’s in this tutorial

        • naterstd says:

          randyzwitch I ran the cluster from home and it worked, so it must be the firewall at work. Any idea what it is in a firewall that would be making it unable to find the hosts when starting up the services for the first time?

        • naterstd I think it might have just been random that it didn’t work at work but worked at home. 
          When you create any Amazon instances, the firewall settings are for your Amazon instances, not your local settings. As long as you can SSH into the main EC2 image, your local firewall has all the permissions it needs.  Then, the main EC2 image runs Cloudera installer, and the instances all talk to each other over the internal IP addresses that Amazon assigns.

  3. Great article.

    Here’s a quick question: If I subscribe to Amazon EC2, will I need to have any software on my client machine (I typically use Redhat Linux)?

    Or, is everything done via the web interface?

    Thanks!

    • Randy Zwitch says:

      You can SSH into your EC2 instance from the terminal as well. But there isn’t really any client software.

Leave a Reply