For the last couple of days I’ve been at the eMetrics conference in San Francisco. There were several panels that discussed big data, both from an engineering standpoint as well as how to adopt newer technologies from a business perspective.
Unfortunately, there wasn’t a whole lot of practical information on how to actually get started using ‘big data’ technologies, of which Hadoop is one. Luckily, it’s fairly easy to create a proof-of-concept Hadoop cluster using Amazon EC2 and Cloudera.
This series will be at least 5 parts, as follows:
- Intro to Hadoop ecosystem and concepts
- Setting up Hadoop cluster on Amazon EC2 using Cloudera
- Populating HDFS with airline dataset files using Hue
- Use Hive and/or Pig to stack datasets into one master dataset
- Doing analytics on the combined Airline dataset using Pig and/or Hive
My aim with this series is to simply explain why you might want to consider using Hadoop for your data storage and processing. There’s a lot of marketing & vendor
bullshit excitement surrounding the term ‘big data’, so for this blog series, I’m just going to focus on the most important points for an analyst/marketer to understand. And other than this sentence, there will be no mentions of MS Excel in terms of ‘big data’, which is barely an appropriate tool for analysis in general, let alone analysis at scale.
What Is Hadoop & Why Are People Talking About It?
At it’s simplest, Hadoop provides a parallel-processing computing framework for data storage and processing. The reason why a parallel-processing framework is important for enterprise-level analysis is due to physical limitations on how quickly a single machine can process information.
As an example, suppose you want to create a report that looks at 1 trillion daily credit card transactions. It’s possible to do your calculations on your local desktop using a tool like SAS. However, the amount of time to process that much data on a desktop with 8GB-16GB of RAM might be 8 hours, 10 hours….24 hours?! So an analyst trying to get an answer can start a SINGLE business question at 8am and hope they get their answer before it’s time to leave at the end of the day. Suffice to say, not a particularly efficient way to run a business.
The solution might seem to add more processors and RAM to a desktop, but what happens when you add more users asking questions? Now you need an enterprise-class server such as Oracle or Teradata (and a few million dollars!). And for every terabyte of data you want to store, you’ll need a few thousand dollars. And that’s just for your nicely structured data…what happens when you want to start storing data such as free-form text that’s not so cleanly structured? Eventually, these types of engineering questions lead you towards a solution like Hadoop.
The reason why there is so much discussion around Hadoop as a data platform is that it solves the problems stated above: excessive time to process vast amounts of data and excessive cost of data storage. By using “commodity hardware” along with some fancy engineering, Hadoop provides an extremely cost-effective and flexible way to handle your enterprise data.
If Hadoop is so Great, Why Doesn’t Everyone Use It?
“Fast, Cheap And Good. Everyone should use Hadoop!” – Every vendor in marketplace
Just like you (probably) don’t use a screwdriver to stir a pot of chicken soup, not every data storage and analysis problem requires the extreme flexibility that Hadoop can provide. From the example above with credit card transactions, a standard relational database might continue to be an acceptable solution if you’re just running a basic SQL query to sum across the rows. But once your data starts moving beyond “rows and columns” and into things such as free-form text, images, clickstream data…the more Hadoop makes sense.
While it’s a tautology, how you know you need a solution like Hadoop is when you suspect you need a solution like Hadoop! If you already have a highly functioning data mart that answers your business questions, you probably don’t need to re-engineer everything just because. If you’re an Internet startup trying to create the next Facebook, then a standard relational database probably won’t cut it.
The best example I heard at eMetrics about the need for Hadoop was from Bob Page (now at Hortonworks, a Hadoop vendor): when Bob was at Ebay, for the longest time they were throwing away data, specifically images from the listings. So prior high storage costs leading to undesirable business outcome (deletion), unstructured data in the form of images…a Hadoop framework made sense to implement. Once implemented, Ebay could look across years of auctions to answer their business questions.
I’m An Analyst, Not An Engineer…What’s The Minimum I Need To Know To Get Started?
“MapReduce, Pigs, HCatalogs, Elephants, Bees, Zoos…Ooozie (Uzi’s)? WTF is everyone talking about?”
If you’ve made it this far and you’re not an engineer or DBA, you’re probably someone who’s interested in data science. You may be someone who already uses R, Python, Ruby or Java. Or, you’re a masochist. In any case, here are the minimum concepts I think you need to know to get started for later blog posts:
- MapReduce: Not explicitly a Hadoop idea, but the idea that data can be split into chunks by a key (“Map”) and then processed into information by one or more functions/transformations (“Reduce”). In the Hadoop sense, MapReduce is generally a reference to a “job” written in Java that performs a data transformation
- HDFS: Hadoop Distributed File System. Raw data gets imported into HDFS (either structured or unstructured), the distributed around to all of the various nodes to allow for parallel processing
- Hive: SQL-like interface so that analysts don’t have to write MapReduce code directly
- Pig: A scripting language used for analysis. Generally, an analyst will use Hive and/or Pig to do their work
- HCatalog: A ‘Data Warehouse’ layer on top of HDFS, similar to how you define a database table (a series of columns in a table with formats)
With the above five Hadoop concepts in place, the next few posts will be to set up a proof-of-concept Hadoop cluster on Amazon EC2, processing ~12GB of publicly available data from the ‘Airline dataset‘. That’s not ‘big’ as ‘big data’ goes, but it’s big enough to be fun to work with.