We’ve finally made it to the final post in this tutorial! In my prior posts about getting started with Hadoop, we’ve covered the entire lifecycle from how to set up a small cluster using Amazon EC2 and Cloudera through how to load data using Hue. With our data loaded in HDFS, we can finally move on to the actual analysis portion of the airline dataset using Hive and Pig.
Basic Descriptive Statistics Using Hive
In part 4 of this tutorial, we used a Hive script to create a view named “vw_airline” to hold all of our airline data. Running a simple query is as easy as running the following in the Hive window in Hue. Note that this is ANSI-standard SQL code, even though we are submitting it using Hive:
A simple query like this is a great way to get a feel for the table, including determining whether or not the files were loaded correctly. Once the results are displayed, you can create simple visualizations like bar charts, line plots, scatterplots and pie charts. The results of the following query are shown below. Knowing this dataset, I can tell that the files were loaded incorrectly; the dips at Years 1994 and 2004 are too few records and will need to be reloaded.
1994 and 2004 have too few rows, which was validated using wc -l 1994.csv at the command line (outside of Hadoop)
Besides just simple counts, Hive supports nearly all standard SQL syntax relative to functions such as SUM, COUNT, MIN, MAX, etc., table joins, user-defined functions (UDF)`, window functions…pretty much everything that you are used to from other SQL tools. AFAIK, the only thing that Hive doesn’t support is nested sub-queries, but that’s on the Stinger initiative for improving Hive. However, depending on the nested subquery being performed, you might be able to accomplish the same thing using a LEFT SEMI JOIN.
Using Pig for Analytics
It’s important to realize that Hadoop isn’t just another RDBMS where you run SQL. Using Pig, you can write scripts for calculation in a similar manner to using other high-level languages such as Python or R.
For example, suppose we wanted to calculate the average distance for each route. A Pig script to calculate this might look like the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
--Load data from view to use
air = LOAD 'default.vw_airline' USING org.apache.hcatalog.pig.HCatLoader();
--Use FOREACH to limit data to origin, dest, distance
--Concatentate origin and destination together, separated by a pipe
--CONCAT appears to only allow two arguments, which is why the function is called twice (to allow 3 arguments)
origindest = FOREACH air generate CONCAT(origin, CONCAT('|' , dest)) as route, distance;
--Group origindest dataset by route
groupedroutes = GROUP origindest BY (route);
--Calculate average distance by route
avg_distance = FOREACH groupedroutes GENERATE group, AVG(origindest.distance);
--Show results in Pig shell
dump avg_distance;
--Write out results to text file, separated by tab (default)
store avg_distance into '/user/hue/avg_distance';
While it is possible to calculate average distance using Hive and a GROUP BY statement, one of the benefits to using Pig is having control over every step of the data flow. So while Hive queries tend to answer a single question at a time, Pig allows an analyst to chain together any number of steps in a data flow. In the example above, we could pass the average distance for each route to another transformation, join it back to the original dataset or do anything else our analyst minds can imagine!
Summary
Over these five blog posts, I’ve outlined how to get started with Hadoop and ‘Big Data’ using Amazon and Cloudera/Hortonworks. Hopefully I’ve been able to demystify the concepts and terminology behind Hadoop, shown that setting up a Hadoop using Cloudera on Amazon EC2 isn’t unsurmountable, and loading data and analyzing it using Hive and Pig isn’t dramatically different than using SQL on other database systems you may have encountered in the past.
While there’s a lot of hype around ‘Big Data’, data sizes aren’t going to be getting any smaller in the future. So spend the $20 in AWS charges and build a Hadoop cluster! There’s no better way to learn than by doing…
#Generate example dataframe with character columnexample<-as.data.frame(c("A","A","B","F","C","G","C","D","E","F"))names(example)<-"strcol"#For every unique value in the string column, create a new 1/0 column#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric datafor(levelinunique(example$strcol)){example[paste("dummy",level,sep="_")]<-ifelse(example$strcol==level,1,0)}viewraw
As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:
The problem you are trying to solve
How much RAM you have available
While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful. Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing 😉
Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:
@randyzwitch If you're running out of RAM with dummy variables, you probably want to use a sparse matrix instead of a data.frame.
When I was working as a digital analytics consultant, no question quite had the ability to cause belly laughs AND angst as, “Can you send me an updated copy of your implementation documentation?” I saw companies that were spending six-or-seven-figures annually on their analytics infrastructure, multi-millions in salary for employees and yet the only way to understand what data they were collecting was to inspect their JavaScript code.
Luckily for Adobe Analytics customers, the API provides a means of generating the framework for a properly-documented implementation. Here’s how to do it using RSiteCatalyst.
Generating Adobe Analytics documentation file
The code below outlines the commands needed to generate an Excel file (see example) with six tabs containing the basic structure of an Adobe Analytics. This report contains all of the report suites you have access to, the elements that reports can be broken down by, traffic variables (props), conversion variables (eVars) and segments available for reporting.
Additionally, within each tab metadata is provided that contains the various settings for variables, so you’ll be able to document the expiration settings for eVars, participation, list variables, segment types and so on.
library("RSiteCatalyst")library("WriteXLS")#Validate that underlying Perl modules for WriteXLS are installed correctly#Will return "Perl found. All required Perl modules were found" if installed correctlytestPerl()#### 1. Pull data for all report suites to create one comprehensive report #####Authenticate with Adobe Analytics APISCAuth("user:company","sharedsecret")#Get Report Suitesreport_suites<-GetReportSuites()#Get Available Elementselements<-GetElements(report_suites$rsid)#Get eVarsevars<-GetEvars(report_suites$rsid)#Get Segmentssegments<-GetSegments(report_suites$rsid)#Get Success Eventsevents<-GetSuccessEvents(report_suites$rsid)#Get Traffic Varsprops<-GetProps(report_suites$rsid)#### 2. Generate a single Excel file#Create list of report suite objects, written as stringsobjlist<-c("report_suites","elements","evars","segments","events","props")#Write out Excel file with auto-width columns, a bolded header row and filters turned onWriteXLS(objlist,"/Users/randyzwitch/Desktop/adobe_analytics_implementation_doc.xlsx",AdjWidth=TRUE,BoldHeaderRow=TRUE,AutoFilter=TRUE)
The only “gotchas” to keep in mind when using the script above is that the user running this will only receive data for report suites they have access to (which is determined by Admin panel setting within Adobe Analytics) and that you need to have the WriteXLS package installed to write to Excel. The WriteXLS package uses Perl as the underlying code, so you’ll need to validate that the package is installed correctly, which is done using the testPerl() function in the package.
This is pretty bare-bones, no?
After you run this code, you’ll have an Excel file that has all of the underlying characteristics of your Adobe Analytics implementation. It’s important to realize that this is only the starting point; a great set of documentation will contain other pieces of information such as where/when the value is set (on entry, every page, when certain events occur, etc.), a layman’s explanation about what the data element means and other business information so your stakeholders can be confident they are using the data correctly. Additionally, you might consider creating a single Excel file for every report suite in your implementation. It’s trivial to modify the code above to subset each data frame used above for a single value of rsid, then write to separate Excel files. Regardless of how your structure your documentation, DOCUMENT YOUR IMPLEMENTATION! The employees that come after you (and your future self!) will thank you.
EDIT 2/4/2016: Thanks to reader @CSitty for pointing out the R code became a little stale. The documentation generating code should now work again for RSiteCatalyst versions >= 1.4 and WriteXLS >= 4.0 (basically, any current version as of the time of this update).