Getting Started Using Hadoop, Part 2: Building a Cluster

In Part 1 of this series, I discussed some of the basic concepts around Hadoop, specifically when it’s appropriate to use Hadoop to solve your data engineering problems and the terminology of the Hadoop eco-system. This post will cover how to install your own Hadoop cluster on Amazon EC2 using Cloudera Manager.

Like prior posts talking about Amazon EC2, this post assumes you have some basic facility with Linux, submitting instructions via the command line, etc. Because really, if you’re interested in Hadoop, using the command line probably isn’t a limiting factor!

Building a 18-node Hadoop Cluster

The SlideShare presentation below shows the steps to building a 18-node Hadoop cluster, using a single m1.large EC2 instance as the ‘Name Node’ and 18 m1.medium EC2 instances as the ‘Data Nodes’.  I chose 18 nodes because according to Cloudera, 20 is the maximum that can be activated at one time through the Amazon API, so let’s stay under the max to avoid any errors. It’s possible to add more instances later through the Cloudera Manager (up to 50 total), if so desired.

Note that going through this tutorial will cost $2.40/hr at current prices ($0.24/hr per m1.large instance and $0.12/hr per m1.medium instance).

Since the SlideShare presentation is potentially not so friendly on the eyes, I’ve also created a PDF download that’s full resolution.

Next Steps

Once you make it through all these steps to set up a Hadoop cluster, you are ready to do some analysis. Part 3 of this tutorial will cover how to upload data into HDFS using Hue.

Update, 7/13/13: As is the case with any open-source project, there have been several changes to the Cloudera Manager that makes setup easier. When getting started, on the screen where it asks “Which Cloudera do you want to deploy?”, choose ‘Cloudera Standard’. Also, once you get to slides 13-14 where you click on the link to get started with Hue, the link now works correctly (you don’t need to search for the Amazon DNS any more!)

Getting Started Using Hadoop, Part 1: Intro

For the last couple of days I’ve been at the eMetrics conference in San Francisco. There were several panels that discussed big data, both from an engineering standpoint as well as how to adopt newer technologies from a business perspective.

Unfortunately, there wasn’t a whole lot of practical information on how to actually get started using ‘big data’ technologies, of which Hadoop is one.  Luckily, it’s fairly easy to create a proof-of-concept Hadoop cluster using Amazon EC2 and Cloudera.

This series will be at least 5 parts, as follows:

  1. Intro to Hadoop ecosystem and concepts
  2. Setting up Hadoop cluster on Amazon EC2 using Cloudera
  3. Populating HDFS with airline dataset files using Hue
  4. Use Hive and/or Pig to stack datasets into one master dataset
  5. Doing analytics on the combined Airline dataset using Pig and/or Hive

My aim with this series is to simply explain why you might want to consider using Hadoop for your data storage and processing. There’s a lot of marketing & vendor bullshit excitement surrounding the term ‘big data’, so for this blog series, I’m just going to focus on the most important points for an analyst/marketer to understand. And other than this sentence, there will be no mentions of MS Excel in terms of ‘big data’, which is barely an appropriate tool for analysis in general, let alone analysis at scale.

What Is Hadoop & Why Are People Talking About It?

At it’s simplest, Hadoop provides a parallel-processing computing framework for data storage and processing. The reason why a parallel-processing framework is important for enterprise-level analysis is due to physical limitations on how quickly a single machine can process information.

As an example, suppose  you want to create a report that looks at 1 trillion daily credit card transactions. It’s possible to do your calculations on your local desktop using a tool like SAS. However, the amount of time to process that much data on a desktop with 8GB-16GB of RAM might be 8 hours, 10 hours….24 hours?! So an analyst trying to get an answer can start a SINGLE business question at 8am and hope they get their answer before it’s time to leave at the end of the day. Suffice to say, not a particularly efficient way to run a business.

The solution might seem to add more processors and RAM to a desktop, but what happens when you add more users asking questions? Now you need an enterprise-class server such as Oracle or Teradata (and a few million dollars!). And for every terabyte of data you want to store, you’ll need a few thousand dollars. And that’s just for your nicely structured data…what happens when you want to start storing data such as free-form text that’s not so cleanly structured? Eventually, these types of engineering questions lead you towards a solution like Hadoop.

The reason why there is so much discussion around Hadoop as a data platform is that it solves the problems stated above: excessive time to process vast amounts of data and excessive cost of data storage. By using “commodity hardware” along with some fancy engineering, Hadoop provides an extremely cost-effective and flexible way to handle your enterprise data.

If Hadoop is so Great, Why Doesn’t Everyone Use It?

“Fast, Cheap And Good. Everyone should use Hadoop!” - Every vendor in marketplace

Just like you (probably) don’t use a screwdriver to stir a pot of chicken soup, not every data storage and analysis problem requires the extreme flexibility that Hadoop can provide. From the example above with credit card transactions, a standard relational database might continue to be an acceptable solution if you’re just running a basic SQL query to sum across the rows. But once your data starts moving beyond “rows and columns” and into things such as free-form text, images, clickstream data…the more Hadoop makes sense.

While it’s a tautology, how you know you need a solution like Hadoop is when you suspect you need a solution like Hadoop! If you already have a highly functioning data mart that answers your business questions, you probably don’t need to re-engineer everything just because. If you’re an Internet startup trying to create the next Facebook, then a standard relational database probably won’t cut it.

The best example I heard at eMetrics about the need for Hadoop was from Bob Page (now at Hortonworks, a Hadoop vendor): when Bob was at Ebay, for the longest time they were throwing away data, specifically images from the listings. So prior high storage costs leading to undesirable business outcome (deletion), unstructured data in the form of images…a Hadoop framework made sense to implement. Once implemented, Ebay could look across years of auctions to answer their business questions.

I’m An Analyst, Not An Engineer…What’s The Minimum I Need To Know To Get Started?

“MapReduce, Pigs, HCatalogs, Elephants, Bees, Zoos…Ooozie (Uzi’s)? WTF is everyone talking about?”

If you’ve made it this far and you’re not an engineer or DBA, you’re probably someone who’s interested in data science. You may be someone who already uses R, Python, Ruby or Java. Or, you’re a masochist. In any case, here are the minimum concepts I think you need to know to get started for later blog posts:

  • MapReduce: Not explicitly a Hadoop idea, but the idea that data can be split into chunks by a key (“Map”) and then processed into information by one or more functions/transformations (“Reduce”). In the Hadoop sense, MapReduce is generally a reference to a “job” written in Java that performs a data transformation
  • HDFS: Hadoop Distributed File System. Raw data gets imported into HDFS (either structured or unstructured), the distributed around to all of the various nodes to allow for parallel processing
  • Hive: SQL-like interface so that analysts don’t have to write MapReduce code directly
  • Pig: A scripting language used for analysis. Generally, an analyst will use Hive and/or Pig to do their work
  • HCatalog: A ‘Data Warehouse’ layer on top of HDFS, similar to how you define a database table (a series of columns in a table with formats)

Next Steps

With the above five Hadoop concepts in place, the next few posts will be to set up a proof-of-concept Hadoop cluster on Amazon EC2, processing ~12GB of publicly available data from the ‘Airline dataset’. That’s not ‘big’ as ‘big data’ goes, but it’s big enough to be fun to work with.

Instructions for Installing & Using R on Amazon EC2

If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.

While you can use packages such as ff and bigmemory to overcome the in-memory limits of your local machine, these additional packages do require some re-engineering of your code. Instead, consider using Amazon EC2 to provision the resources you need.  Here are two ways to get started…

Use a Pre-Made AMI

In the great open-source tradition, there are already R Amazon EC2 AMI images available out there to use. The way I got started was using the pre-built images that Louis Aslett provides on his website.  Louis also provides great instructions on learning about EC2, so if you’ve never worked with R in the cloud or a just looking to get up and running fast, his website is a great means to do so.

Build Your Own Image

Alternatively, suppose you want to build your own customized image. For example, say you wanted to build a proof-of-concept ‘big data’ environment, so you want R, Python, MySQL and MongoDB.  The commands to accomplish this are listed below. Note that I’m assuming you have a basic understanding of working through the Amazon Web Service Console (AWS), including being able to get to the ‘Classic Wizard’ for launching an EC2 instance. You also should have a basic understanding of working from the command line.

Setting Up Amazon EC2 Instance

  1. Launch an Ubuntu 12.04.1 LTS 64-bit image. You can use a free “t1.micro” image while building, then provision more resources later once you’re ready for analysis.
  2. Accept defaults until you get to Key-Pair tab. The Key-Pair is what allows you to login securely to your Amazon EC2 image without a password. Create and download a Key-Pair if you don’t already have one or choose an existing Key-Pair if you do.
  3. When you get to the ‘Security Groups’ tab, create a security group that has the following ports open: 22 (SSH), 80 (HTTP), 443 (HTTPS), 3389 (RDP, optional), and 8787 (RStudio Server).
  4. Work through the rest of the Wizard until your instance is launched.

Connecting to Amazon EC2 Instance

  1. There are two ways to connect to your EC2 image, both of which can be found by going to the “Actions” tab in the AWS console, then selecting “Connect” from the drop-down. The rest of this tutorial assumes you connect via a stand-alone SSH client (such as Terminal for Mac OSX)
  2. Connect to your instance by typing the code provided to you, such as: ssh -i me-aws.pem
  3. Be sure that before you submit this code, you either modify the line to put the directory in front of your Key-Pair, or “cd” to the directory where the Key-Pair is located
  4. After submitting the connect code, you will get a warning saying that the ‘authenticity can’t be established, do you want to continue?’  Type yes and hit enter to log in.

Installing Base R

Once you are logged in, there are about a dozen commands that need to be submitted. Some commands run quickly, others can take 10-15 minutes to run through the entire installation process. Depending on how quickly each command completes, you may or may not need to type “sudo” in front of each command to have proper access rights for installation. Submit each line one at a time.

#Create a user, home directory and set password
sudo useradd rstudio
sudo mkdir /home/rstudio
sudo passwd rstudio
sudo chmod -R 0777 /home/rstudio

#Update all files from the default state
sudo apt-get update
sudo apt-get upgrade

#Add CRAN mirror to custom sources.list file using vi
sudo vi /etc/apt/sources.list.d/sources.list

#Add following line (or your favorite CRAN mirror)
deb precise/

#Update files to use CRAN mirror
#Don't worry about error message
sudo apt-get update

#Install latest version of R
#Install without verification
sudo apt-get install r-base

While not strictly required to run R, I also like to run the following commands to install the Curl and XML packages as well, which are useful if you want to use R to connect to any web data/APIs.

#Install in order to use RCurl & XML
sudo aptitude install libcurl4-openssl-dev
sudo apt-get install libxml2-dev

With these commands run, you will now be able to run R from the command line just by typing “R” at the prompt. However, it would be a crime to do all this work and not install RStudio Server, which makes working in R so much easier.

Installing RStudio Server

Once you’ve installed the above commands, you can now access RStudio through your local browser. Navigate to the Public DNS of your image on port 8787, similar to:

The login and password will be the values you used in the image creation process (I used rstudio as my username above).

Installing MySQL, Python, and MongoDB

If you’ve made it this far, I’m sure you realize that installing additional packages will only take a line or two of code. Even better, Python is installed by default on Linux, so we really only need to install MySQL and MongoDB.

#Install MySQL

sudo apt-get install mysql-common
sudo apt-get install mysql-server

#Install MongoDB

sudo apt-get install mongodb


While the steps above can be intimidating if you’ve never used Linux or worked on the command line, but once you get the hang of it, your ability to use R on ‘big data’ (however you define it) will be much improved. For only a few pennies to up to a few dollars per hour, you can use hardware having 16-64GB of RAM or more.

EDIT, 4/9: The code is wrapping weird on some monitors.  Click here for the commands in a .txt. file.

  • RSiteCatalyst Version 1.4.16 Release Notes
  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • ODSC webinar: End-to-End Data Science Without Leaving the GPU
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS