For the last couple of days I’ve been at the eMetrics conference in San Francisco. There were several panels that discussed big data, both from an engineering standpoint as well as how to adopt newer technologies from a business perspective.
Unfortunately, there wasn’t a whole lot of practical information on how to actually get started using ‘big data’ technologies, of which Hadoop is one. Luckily, it’s fairly easy to create a proof-of-concept Hadoop cluster using Amazon EC2 and Cloudera.
My aim with this series is to simply explain why you might want to consider using Hadoop for your data storage and processing. There’s a lot of marketing & vendor bullshit excitement surrounding the term ‘big data’, so for this blog series, I’m just going to focus on the most important points for an analyst/marketer to understand. And other than this sentence, there will be no mentions of MS Excel in terms of ‘big data’, which is barely an appropriate tool for analysis in general, let alone analysis at scale.
What Is Hadoop & Why Are People Talking About It?
At it’s simplest, Hadoop provides a parallel-processing computing framework for data storage and processing. The reason why a parallel-processing framework is important for enterprise-level analysis is due to physical limitations on how quickly a single machine can process information.
As an example, suppose you want to create a report that looks at 1 trillion daily credit card transactions. It’s possible to do your calculations on your local desktop using a tool like SAS. However, the amount of time to process that much data on a desktop with 8GB-16GB of RAM might be 8 hours, 10 hours….24 hours?! So an analyst trying to get an answer can start a SINGLE business question at 8am and hope they get their answer before it’s time to leave at the end of the day. Suffice to say, not a particularly efficient way to run a business.
The solution might seem to add more processors and RAM to a desktop, but what happens when you add more users asking questions? Now you need an enterprise-class server such as Oracle or Teradata (and a few million dollars!). And for every terabyte of data you want to store, you’ll need a few thousand dollars. And that’s just for your nicely structured data…what happens when you want to start storing data such as free-form text that’s not so cleanly structured? Eventually, these types of engineering questions lead you towards a solution like Hadoop.
The reason why there is so much discussion around Hadoop as a data platform is that it solves the problems stated above: excessive time to process vast amounts of data and excessive cost of data storage. By using “commodity hardware” along with some fancy engineering, Hadoop provides an extremely cost-effective and flexible way to handle your enterprise data.
If Hadoop is so Great, Why Doesn’t Everyone Use It?
“Fast, Cheap And Good. Everyone should use Hadoop!” - Every vendor in marketplace
Just like you (probably) don’t use a screwdriver to stir a pot of chicken soup, not every data storage and analysis problem requires the extreme flexibility that Hadoop can provide. From the example above with credit card transactions, a standard relational database might continue to be an acceptable solution if you’re just running a basic SQL query to sum across the rows. But once your data starts moving beyond “rows and columns” and into things such as free-form text, images, clickstream data…the more Hadoop makes sense.
While it’s a tautology, how you know you need a solution like Hadoop is when you suspect you need a solution like Hadoop! If you already have a highly functioning data mart that answers your business questions, you probably don’t need to re-engineer everything just because. If you’re an Internet startup trying to create the next Facebook, then a standard relational database probably won’t cut it.
The best example I heard at eMetrics about the need for Hadoop was from Bob Page (now at Hortonworks, a Hadoop vendor): when Bob was at Ebay, for the longest time they were throwing away data, specifically images from the listings. So prior high storage costs leading to undesirable business outcome (deletion), unstructured data in the form of images…a Hadoop framework made sense to implement. Once implemented, Ebay could look across years of auctions to answer their business questions.
I’m An Analyst, Not An Engineer…What’s The Minimum I Need To Know To Get Started?
If you’ve made it this far and you’re not an engineer or DBA, you’re probably someone who’s interested in data science. You may be someone who already uses R, Python, Ruby or Java. Or, you’re a masochist. In any case, here are the minimum concepts I think you need to know to get started for later blog posts:
MapReduce: Not explicitly a Hadoop idea, but the idea that data can be split into chunks by a key (“Map”) and then processed into information by one or more functions/transformations (“Reduce”). In the Hadoop sense, MapReduce is generally a reference to a “job” written in Java that performs a data transformation
HDFS: Hadoop Distributed File System. Raw data gets imported into HDFS (either structured or unstructured), the distributed around to all of the various nodes to allow for parallel processing
Hive: SQL-like interface so that analysts don’t have to write MapReduce code directly
Pig: A scripting language used for analysis. Generally, an analyst will use Hive and/or Pig to do their work
HCatalog: A ‘Data Warehouse’ layer on top of HDFS, similar to how you define a database table (a series of columns in a table with formats)
With the above five Hadoop concepts in place, the next few posts will be to set up a proof-of-concept Hadoop cluster on Amazon EC2, processing ~12GB of publicly available data from the ‘Airline dataset’. That’s not ‘big’ as ‘big data’ goes, but it’s big enough to be fun to work with.
If you’re an R user, you’ve surely heard all the hype around ‘big data’ and how R is commonly used to analyze these volumes of data. One thing that’s often missing from the discussion is HOW to work around issues using big data and R, specifically how to deal with the fact that R stores all its objects in-memory.
While you can use packages such as ff and bigmemory to overcome the in-memory limits of your local machine, these additional packages do require some re-engineering of your code. Instead, consider using Amazon EC2 to provision the resources you need. Here are two ways to get started…
Use a Pre-Made AMI
In the great open-source tradition, there are already R Amazon EC2 AMI images available out there to use. The way I got started was using the pre-built images that Louis Aslett provides on his website. Louis also provides great instructions on learning about EC2, so if you’ve never worked with R in the cloud or a just looking to get up and running fast, his website is a great means to do so.
Launch an Ubuntu 12.04.1 LTS 64-bit image. You can use a free “t1.micro” image while building, then provision more resources later once you’re ready for analysis.
Accept defaults until you get to Key-Pair tab. The Key-Pair is what allows you to login securely to your Amazon EC2 image without a password. Create and download a Key-Pair if you don’t already have one or choose an existing Key-Pair if you do.
When you get to the ‘Security Groups’ tab, create a security group that has the following ports open: 22 (SSH), 80 (HTTP), 443 (HTTPS), 3389 (RDP, optional), and 8787 (RStudio Server).
Work through the rest of the Wizard until your instance is launched.
Connecting to Amazon EC2 Instance
There are two ways to connect to your EC2 image, both of which can be found by going to the “Actions” tab in the AWS console, then selecting “Connect” from the drop-down. The rest of this tutorial assumes you connect via a stand-alone SSH client (such as Terminal for Mac OSX)
Connect to your instance by typing the code provided to you, such as: ssh -i me-aws.pem firstname.lastname@example.org
Be sure that before you submit this code, you either modify the line to put the directory in front of your Key-Pair, or “cd” to the directory where the Key-Pair is located
After submitting the connect code, you will get a warning saying that the ‘authenticity can’t be established, do you want to continue?’ Type yes and hit enter to log in.
Installing Base R
Once you are logged in, there are about a dozen commands that need to be submitted. Some commands run quickly, others can take 10-15 minutes to run through the entire installation process. Depending on how quickly each command completes, you may or may not need to type “sudo” in front of each command to have proper access rights for installation. Submit each line one at a time.
#Create a user, home directory and set passwordsudo useradd rstudio
sudo mkdir /home/rstudio
sudo passwd rstudio
sudo chmod -R 0777 /home/rstudio
#Update all files from the default statesudo apt-get update
sudo apt-get upgrade
#Add CRAN mirror to custom sources.list file using visudo vi /etc/apt/sources.list.d/sources.list
#Add following line (or your favorite CRAN mirror)
deb http://lib.stat.cmu.edu/R/CRAN/bin/linux/ubuntu precise/
#Update files to use CRAN mirror#Don't worry about error messagesudo apt-get update
#Install latest version of R#Install without verificationsudo apt-get install r-base
While not strictly required to run R, I also like to run the following commands to install the Curl and XML packages as well, which are useful if you want to use R to connect to any web data/APIs.
#Install in order to use RCurl & XMLsudo aptitude install libcurl4-openssl-dev
sudo apt-get install libxml2-dev
With these commands run, you will now be able to run R from the command line just by typing “R” at the prompt. However, it would be a crime to do all this work and not install RStudio Server, which makes working in R so much easier.
Installing RStudio Server
Once you’ve installed the above commands, you can now access RStudio through your local browser. Navigate to the Public DNS of your image on port 8787, similar to:
The login and password will be the values you used in the image creation process (I used rstudio as my username above).
Installing MySQL, Python, and MongoDB
If you’ve made it this far, I’m sure you realize that installing additional packages will only take a line or two of code. Even better, Python is installed by default on Linux, so we really only need to install MySQL and MongoDB.
While the steps above can be intimidating if you’ve never used Linux or worked on the command line, but once you get the hang of it, your ability to use R on ‘big data’ (however you define it) will be much improved. For only a few pennies to up to a few dollars per hour, you can use hardware having 16-64GB of RAM or more.
EDIT, 4/9: The code is wrapping weird on some monitors. Click here for the commands in a .txt. file.
With the big release of R 3.0 today comes an unfortunate side effect of needing to re-install all of your packages. Luckily, R provides a pretty easy method of getting all of your packages into a list for automated re-install. Here’s how to do it for OSX users with a default install to the Library:
For Windows users, the same general process should work, assuming you change the file reference in the installed.packages function to the proper Windows location. The one downside to this method is that only packages that are listed on CRAN will be reinstalled, so if you installed anything using devtools, you’ll need to re-install those packages again. But at the very least, the code snippet above is a quick way to re-install most of your packages. EDIT, 4/4/13: Per Noam below, you can also use a more direct method: update.packages(ask=FALSE, checkBuilt = TRUE)