Getting Started Using Hadoop, Part 1: Intro

For the last couple of days I’ve been at the eMetrics conference in San Francisco. There were several panels that discussed big data, both from an engineering standpoint as well as how to adopt newer technologies from a business perspective.

Unfortunately, there wasn’t a whole lot of practical information on how to actually get started using ‘big data’ technologies, of which Hadoop is one.  Luckily, it’s fairly easy to create a proof-of-concept Hadoop cluster using Amazon EC2 and Cloudera.

This series will be at least 5 parts, as follows:

  1. Intro to Hadoop ecosystem and concepts
  2. Setting up Hadoop cluster on Amazon EC2 using Cloudera
  3. Populating HDFS with airline dataset files using Hue
  4. Use Hive and/or Pig to stack datasets into one master dataset
  5. Doing analytics on the combined Airline dataset using Pig and/or Hive

My aim with this series is to simply explain why you might want to consider using Hadoop for your data storage and processing. There’s a lot of marketing & vendor bullshit excitement surrounding the term ‘big data’, so for this blog series, I’m just going to focus on the most important points for an analyst/marketer to understand. And other than this sentence, there will be no mentions of MS Excel in terms of ‘big data’, which is barely an appropriate tool for analysis in general, let alone analysis at scale.

What Is Hadoop & Why Are People Talking About It?

At it’s simplest, Hadoop provides a parallel-processing computing framework for data storage and processing. The reason why a parallel-processing framework is important for enterprise-level analysis is due to physical limitations on how quickly a single machine can process information.

As an example, suppose  you want to create a report that looks at 1 trillion daily credit card transactions. It’s possible to do your calculations on your local desktop using a tool like SAS. However, the amount of time to process that much data on a desktop with 8GB-16GB of RAM might be 8 hours, 10 hours….24 hours?! So an analyst trying to get an answer can start a SINGLE business question at 8am and hope they get their answer before it’s time to leave at the end of the day. Suffice to say, not a particularly efficient way to run a business.

The solution might seem to add more processors and RAM to a desktop, but what happens when you add more users asking questions? Now you need an enterprise-class server such as Oracle or Teradata (and a few million dollars!). And for every terabyte of data you want to store, you’ll need a few thousand dollars. And that’s just for your nicely structured data…what happens when you want to start storing data such as free-form text that’s not so cleanly structured? Eventually, these types of engineering questions lead you towards a solution like Hadoop.

The reason why there is so much discussion around Hadoop as a data platform is that it solves the problems stated above: excessive time to process vast amounts of data and excessive cost of data storage. By using “commodity hardware” along with some fancy engineering, Hadoop provides an extremely cost-effective and flexible way to handle your enterprise data.

If Hadoop is so Great, Why Doesn’t Everyone Use It?

“Fast, Cheap And Good. Everyone should use Hadoop!” - Every vendor in marketplace

Just like you (probably) don’t use a screwdriver to stir a pot of chicken soup, not every data storage and analysis problem requires the extreme flexibility that Hadoop can provide. From the example above with credit card transactions, a standard relational database might continue to be an acceptable solution if you’re just running a basic SQL query to sum across the rows. But once your data starts moving beyond “rows and columns” and into things such as free-form text, images, clickstream data…the more Hadoop makes sense.

While it’s a tautology, how you know you need a solution like Hadoop is when you suspect you need a solution like Hadoop! If you already have a highly functioning data mart that answers your business questions, you probably don’t need to re-engineer everything just because. If you’re an Internet startup trying to create the next Facebook, then a standard relational database probably won’t cut it.

The best example I heard at eMetrics about the need for Hadoop was from Bob Page (now at Hortonworks, a Hadoop vendor): when Bob was at Ebay, for the longest time they were throwing away data, specifically images from the listings. So prior high storage costs leading to undesirable business outcome (deletion), unstructured data in the form of images…a Hadoop framework made sense to implement. Once implemented, Ebay could look across years of auctions to answer their business questions.

I’m An Analyst, Not An Engineer…What’s The Minimum I Need To Know To Get Started?

“MapReduce, Pigs, HCatalogs, Elephants, Bees, Zoos…Ooozie (Uzi’s)? WTF is everyone talking about?”

If you’ve made it this far and you’re not an engineer or DBA, you’re probably someone who’s interested in data science. You may be someone who already uses R, Python, Ruby or Java. Or, you’re a masochist. In any case, here are the minimum concepts I think you need to know to get started for later blog posts:

  • MapReduce: Not explicitly a Hadoop idea, but the idea that data can be split into chunks by a key (“Map”) and then processed into information by one or more functions/transformations (“Reduce”). In the Hadoop sense, MapReduce is generally a reference to a “job” written in Java that performs a data transformation
  • HDFS: Hadoop Distributed File System. Raw data gets imported into HDFS (either structured or unstructured), the distributed around to all of the various nodes to allow for parallel processing
  • Hive: SQL-like interface so that analysts don’t have to write MapReduce code directly
  • Pig: A scripting language used for analysis. Generally, an analyst will use Hive and/or Pig to do their work
  • HCatalog: A ‘Data Warehouse’ layer on top of HDFS, similar to how you define a database table (a series of columns in a table with formats)

Next Steps

With the above five Hadoop concepts in place, the next few posts will be to set up a proof-of-concept Hadoop cluster on Amazon EC2, processing ~12GB of publicly available data from the ‘Airline dataset’. That’s not ‘big’ as ‘big data’ goes, but it’s big enough to be fun to work with.

  • RSiteCatalyst Version 1.4.16 Release Notes
  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • ODSC webinar: End-to-End Data Science Without Leaving the GPU
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS