Maybe I Don't Really Know R After All

Lately, I’ve been feeling that I’m spreading myself too thin in terms of programming languages. At work, I spend most of my time in Hive/SQL, with the occasional Python for my smaller data. I really prefer Julia, but I’m alone at work on that one. And since I maintain a package on CRAN (RSiteCatalyst), I frequently spend my evenings bug fix programming in R. Then, there’s the desire to learn a Java-based language like Scala (or, Java)…maybe Spark for my Hadoop work…

So last night, when I ran into this series of follies with R, it really makes me wonder if I really understand how R works.

jsonlite:fromJSON

As part of the overall concept of my RSiteCatalyst package, I’m trying to make it as easy as possible for digital analysts to get their data via the Adobe Analytics API.  As such, I abstract away the need to build JSON to request reports and parse the API answer from JSON to a data frame. Sometimes it’s easy, but sometimes you get something like this:

nested_r_dataframe

In case it’s not clear what’s going on here, fromJSON() from jsonlite returns a data frame as best as it can, but we have a list (of data frames!) nested inside of a column named “breakdown”. There are 12 rows here, but the proper data structure would be to take the data frame inside of ‘breakdown’ and append all of the fields from the original 12 rows, repeating the values down the rows. So something like 72 rows (12 original rows, 6 row data frames inside of the ‘breakdown’ column).

Loop and Accumulate

Because this is such a small data frame, and because *apply functions are too frustrating in most cases, to parse this I went with the tried-and-true loop and accumulate. But instead of immediately getting what I wanted, I got this fantastic R error message:

1
2
3
4
5
6
7
8
9
10
11
12
#Loop over df and accumulate results
parsed_df <- data.frame()
for(i in 1:nrow(df)){
  temp <- cbind(df[i,],breakdown_list[[i]])
  parsed_df <- rbind(parsed_df, temp)
}

There were 12 warnings (use warnings() to see them)
>warnings()
Warning messages:
  1: In data.frame(..., check.names = FALSE) :
  row names were found from a short variable and have been discarded

Row names from a short variable? Off to StackOverflow, the savior of all language hackers, which lets me know I just need to add an argument to my cbind() function. Trying again:

1
2
3
4
5
6
7
8
9
10
11
12
13
#Loop over df and accumulate results
#Adding row.names = NULL fixes error message
parsed_df <- data.frame()
for(i in 1:nrow(df)){
  temp <- cbind(df[i,],breakdown_list[[i]], row.names = NULL)
  parsed_df <- rbind(parsed_df, temp)
}

names(parsed_df)

> names(parsed_df)
 [1] "name"           "year"           "month"          "day"            "hour"           "minute"         "breakdownTotal"
 [8] "name"           "trend"          "counts"  

So I successfully created an (84,10)-sized data frame, but cbind() allowed me to name two columns in the data frame “name”! Running ‘parsed_df$name’ at the REPL returns the first instance. So now, I have to use the unstable method of referring to the second ‘name’ column by position number if I want to access it (or, rename it using names() of course). The way I realized this behavior was occurring was that I tried to use plyr::rename and kept changing the name of two columns!

Final Solution

In order to get past my duplicate name issue, I eventually renamed the ‘name’ columns individually by each object, prior to cbind():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Separate breakdown list and original data frame into different objects
df <- ex_df$report$data
breakdown_list <- df$breakdown
df$breakdown <- NULL

#Loop over df and accumulate results
parsed_df <- data.frame()
for(i in 1:nrow(df)){
  right_df <-  breakdown_list[[i]]
  right_df <- rename(right_df, replace=c("name" = report_raw$report$elements$id[2]))
  temp <- cbind(df[i,],right_df, row.names = NULL)
  parsed_df <- rbind(parsed_df, temp)
}
parsed_df <- rename(parsed_df, replace=c("counts" = report_raw$report$metrics$id))

In the end, I found an answer to my solution, but it seems like every time I use R the more oddities I’m able to encounter/generate. At this point, I’m starting to question whether I really understand the underpinnings of how R works. It might be time to stop trying to be a language polyglot so much and focus on really learning a few of these tools in-depth.


Using Julia As A "Glue" Language

While much of the focus in the Julia community has been on the performance aspects of Julia relative to other scientific computing languages, Julia is also perfectly suited to ‘glue’ together multiple data sources/languages. In this blog post, I will cover how to create an interactive plot using Gadfly.jl, by first preparing the data using Hadoop and Teradata Aster via ODBC.jl.

The example problem I am going to solve is calculating and visualizing the number of airplanes by hour in the air at any given time in the U.S. for the year 1987. Because of the structure and storage of the underlying data, I will need to write some custom Hive code, upload the data to Teradata Aster via a command-line utility, re-calculate the number of flights per hour using a built-in Aster function, then using Julia to visualize the data.

Step 1: Getting Data From Hadoop

In a prior set of blog posts, I talked about loading the airline dataset into Hadoop, then analyzing the dataset using Hive or Pig. Using ODBC.jl, we can use Hive via Julia to submit our queries. The hardest part of setting up this process is making sure that you have the appropriate Hive drivers for your Hadoop cluster and credentials (which isn’t covered here). Once you have your DSN set up, running Hive queries is as easy as the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
using ODBC

#Connect to Hadoop cluster via Hive (pre-defined Windows DSN in ODBC Manager)
hiveconn = ODBC.connect("Production hiveserver2"; usr="your-user-name", pwd="your-password-here")

#Clean data, return results directly to file
#Data returned with have origin of flight, flight takeoff, flight landing and elapsed time
hive_query_string =
"select
origin,
from_unixtime(flight_takeoff_datetime_origin) as flight_takeoff_datetime_origin,
from_unixtime(flight_takeoff_datetime_origin + (actualelapsedtime * 60)) as flight_landing_datetime_origin,
actualelapsedtime
from
(select
origin,
unix_timestamp(CONCAT(year,\"-\", month, \"-\", dayofmonth, \" \", SUBSTR(LPAD(deptime, 4, 0), 1, 2), \":\", SUBSTR(LPAD(deptime, 4, 0), 3, 4), \":\", \"00\"))  as flight_takeoff_datetime_origin,
actualelapsedtime
from vw_airline
where year = 1987 and actualelapsedtime > 0) inner_query;"

#Run query, save results directly to file
query(hive_query_string, hiveconn;output="C:\\airline_times.csv",delim=',')

In this code, I’ve written my query as a Julia string, to keep my code easily modifiable. Then, I pass the Julia string object to the query() function, along with my ODBC connection object. This query runs on Hadoop through Hive, then streams the result directly to my local hard drive, making this a very RAM efficient (though I/O inefficient!) operation.

Step 2: Shelling Out To Load Data To Aster

Once I created the file with my Hadoop results in it, I now have a decision point: I can either A) do the rest of the analysis in Julia or B) use a different tool for my calculations. Because this is a toy example, I’m going to use Teradata Aster to do my calculations, which provides a convenient function called burst() to regularize timestamps into fixed intervals. But before I can use Aster to ‘burst’ my data, I first need to upload it to the database.

While I could loop over the data within Julia and insert each record one at a time, Teradata provides a command-line utility to upload data in parallel. Running command-line scripts from within Julia is as easy as using the run() command, with each command surrounded in backticks:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#Connect to Aster (pre-defined Windows DSN in ODBC Manager)
asterconn = ODBC.connect("aster01"; usr="your-user-name", pwd="your-password")

#Create table to hold airline results
create_airline_table_statement =
"create table ebi_temp.airline
(origin varchar,
flight_takeoff_datetime_origin timestamp,
flight_landing_datetime_origin timestamp,
actualelapsedtime int,
partition key (origin))"

#Execute query
query(create_airline_table_statement, asterconn)

#Create airport table
#Data downloaded from http://openflights.org/data.html
create_airport_table_statement =
"create table ebi_temp.airport
(airport_id int,
name varchar,
city varchar,
country varchar,
IATAFAA varchar,
ICAO varchar,
latitude float,
longitude float,
altitude int,
timezone float,
dst varchar,
partition key (country))"

#Execute query
query(create_airport_table_statement, asterconn)

#Upload data via run() command
#ncluster_loader utility already on Windows PATH
run(`ncluster_loader -h 192.168.1.1 -U your-user-name -w your-password -d aster01 -c --skip-rows=1 --el-enabled --el-table e_dist_error_2 --el-schema temp temp.airline C:\\airline_times.csv`)

run(`ncluster_loader -h 192.168.1.1 -U your-user-name -w your-password -d aster01 -c --el-enabled --el-table e_dist_error_2 --el-schema temp temp.airport C:\\airports.dat`)

While I could’ve run this at the command-line, having all of this within an IJulia Notebook keeps all my work together, should I need to re-run this in the future.

Step 3: Using Aster For Calculations

With my data now loaded in Aster, I can normalize the timestamps to UTC, then ‘burst’ the data into regular time intervals. Again, all of this can be done via ODBC from within Julia:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#Normalize timestamps from local time to UTC time
aster_view_string = "
create view temp.vw_airline_times_utc as
select
row_number() over(order by flight_takeoff_datetime_origin) as unique_flight_number,
origin,
flight_takeoff_datetime_origin,
flight_landing_datetime_origin,
flight_takeoff_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_takeoff_datetime_utc,
flight_landing_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_landing_datetime_utc,
timezone
from temp.airline
left join temp.airport on (airline.origin = airport.iatafaa);"

#Execute query
query(aster_view_string, asterconn)

#Teradata Aster SQL-H functionality, accessed via ODBC query
burst_query_string =
"create table temp.airline_burst_hour distribute by hash (origin) as
SELECT
*,
\"INTERVAL_START\"::date as calendar_date,
extract(HOUR from \"INTERVAL_START\") as hour_utc
FROM BURST(
     ON (select
        unique_flight_number,
        origin,
        flight_takeoff_datetime_utc,
        flight_landing_datetime_utc
        FROM temp.vw_airline_times_utc
)
     START_COLUMN('flight_takeoff_datetime_utc')
     END_COLUMN('flight_landing_datetime_utc')
     BURST_INTERVAL('3600')
);"

#Execute query
query(burst_query_string, asterconn)

Since it might not be clear what I’m doing here, the burst() function in Aster takes a row of data with a start and end timestamp, and (potentially) returns multiple rows which normalize the time between the timestamps. If you’re familiar with pandas in Python, it’s a similar functionality to resample on a series of timestamps.

Step 4: Download Smaller Data Into Julia, Visualize

Now that the data has been processed from Hadoop to Aster through a series of queries, we now have a much smaller dataset that can be loaded into RAM and processed by Julia:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#Calculate the number of flights per hour per day
flights_query = "
select
calendar_date,
hour_utc,
sum(1) as num_flights
from temp.airline_burst_hour
group by 1,2
order by 1,2;"

#Bring results into Julia DataFrame
flights_per_day = query(flights_query, asterconn)

using Gadfly

#Create boxplot, with one box plot per hour
set_default_plot_size(20cm, 12cm)
p = plot(flights_per_day , x="hour_utc", y="num_flights",
            Guide.xlabel("Hour UTC"),
            Guide.ylabel("Flights In Air"),
            Guide.title("Number of Flights In Air To/From U.S. By Hour - 1987"),
            Scale.y_continuous(minvalue=0, maxvalue=4000),
            Geom.boxplot)

The Gadfly code above produces the following plot:

gadfly

Since this chart is in UTC, it might not be obvious what the interpretation is of the trend. Because the airline dataset represents flights either leaving or returning to the United States, there are many fewer planes in the air overnight and the early morning hours (UTC 7-10, 2-5am Eastern). During the hours when the airports are open, there appears to be a limit of roughly 2500 planes per hour in the sky.

Why Not Do All Of This In Julia?

At this point, you might be tempted to wonder why go through all of this effort? Couldn’t this all be done in Julia?

Yes, you probably could do all of this work in Julia with a sufficiently large amount of RAM. As a proof-of-concept, I hope I’ve shown that there is much more to Julia than micro-benchmarking Julia’s speed relative to other scientific programming languages. You’ll notice that in none of my code have I used any type annotations, as none would really make sense (nor would they improve performance).  And although this is a toy example purposely using multiple systems, I much more frequently use Julia in this manner at work than doing linear algebra or machine learning.

So next time you’re tempted to use Python or R or shell scripting or whatever, consider Julia as well. Julia is just as at-home as a scripting language as a scientific computing language.


Five Hard-Won Lessons Using Hive

EDIT, 9/8/2016: Hive has come a long way in the two years since I’ve written this. While some of the code snippets might still work, it’s likely the case that this information is so out-of-date to be nothing more than a reflection of working with Hadoop in 2014.

I’ve been spending a ton of time lately on the data engineering side of ‘data science’, so I’ve been writing a lot of Hive queries. Hive is a great tool for querying large amounts of data, without having to know very much about the underpinnings of Hadoop. Unfortunately, there are a lot of things about Hive (version 0.12 and before) that aren’t quite the same as SQL and have caused me a bunch of frustration; here they are, in no particular order.

1. Set Hive Temp directory To Same As Final Output Directory

When doing a “Create Table As” (CTAS) statement in Hive, Hive allocates temp space for the Map and Reduce portions of the job. If you’re not lucky, the temp space for the job will be somewhere different than where your table actually ends up being saved, resulting in TWO I/O operations instead of just one. This can lead to a painful delay in when your Hive job says it is finished vs. when the table becomes available (one time, I saw a 30 hour delay writing 5TB of data).

If your Hive jobs seem to hang after the Job Tracker says they are complete, try this setting at the beginning of your session:

set hive.optimize.insert.dest.volume=true;

2. Column Aliasing In Group By/Order By

Not sure why this isn’t a default, but if you want to be able to reference your column names by position (i.e. group by 1,2) instead of by name (i.e. group by name, age), then run this at the beginning of your session:

set hive.groupby.orderby.position.alias=true;

3. Be Aware Of Predicate Push-Down Rules

In Hive, you can get great performance gains if you A) partition your table by commonly used columns/business concepts (i.e. Day, State, Market, etc.) and B) you use the partitions in a WHERE clause. These are known as partition-based queries. Otherwise, if you don’t use a partition in your WHERE clause, you will get a full table scan.

Unfortunately, when doing an OUTER JOIN, Hive will sometimes ignore the fact that your WHERE clause is on a partition and do a full table scan anyway. In order to get Hive to push your predicate down and avoid a full table scan, put your predicate on the JOIN instead of the WHERE clause:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
--#### Assume sales Hive table partitioned by day_id ####--

--Full Table Scan
select
employees.id,
b.sales
from employees
left join sales on (employees.id = sales.employee_id)
where day_id between '2014-03-01' and '2014-05-31';

--Partitioned-based query
select
employees.id,
b.sales
from employees
left join sales on (employees.id = sales.employee_id and sales.day_id between '2014-03-01' and '2014-05-31');

If you don’t want to think about the different rules, you can generally put your limiting clauses inside your JOIN clause instead of on your WHERE clause. It should just be a matter of preference (until your query performance indicates it isn’t!)

4. Calculate And Append Percentiles Using CROSS JOIN

Suppose you want to calculate the top 10% of your customers by sales. If you try to do the following, Hive will complain about needing a GROUP BY, because percentile_approx() is a summary function:

1
2
3
4
5
6
--Hive expects that you want to calculate your percentiles by account_number and sales
--This code will generate an error about a missing GROUP BY statement
select
account_number,
sales,
CASE WHEN sales > percentile_approx(sales, .9) THEN 1 ELSE 0 END as top10pct_sales

To get around the the need for a GROUP BY, we can use a CROSS JOIN. A CROSS JOIN is another name for a Cartesian Join, meaning all of the rows from the first table will be joined to ALL of the rows of the second table. Because the subquery only returns one row, the CROSS JOIN provides the desired affect of joining the percentile values back to the original table while keeping the same number of rows from the original table. Generally, you don’t want to do a CROSS JOIN (because relational data generally is joined on a key), but this is a good use case.

5.  Calculating a Histogram

Creating a histogram using Hive should be as simple as calling the histogram_numeric() function. However, the syntax and results of this function are just plain weird. To create a histogram, you can run the following:

1
2
3
4
5
6
7
select
histogram_numeric(salary, 20) as salary_hist
from
sample_08;

--Results
[{"x":23507.68627450983,"y":255.0},{"x":31881.7647058824,"y":340.0},{"x":39824.11498257844,"y":287.0},{"x":47615.58011049725,"y":181.0},{"x":55667.01219512195,"y":164.0},{"x":59952.499999999985,"y":8.0},{"x":66034.67153284674,"y":137.0},{"x":75642.31707317074,"y":82.0},{"x":82496.13636363638,"y":44.0},{"x":91431.66666666667,"y":60.0},{"x":100665.71428571428,"y":21.0},{"x":107326.66666666667,"y":15.0},{"x":121248.74999999999,"y":16.0},{"x":142070.0,"y":2.0},{"x":153896.6666666667,"y":6.0},{"x":162310.0,"y":6.0},{"x":169810.0,"y":2.0},{"x":176740.0,"y":2.0},{"x":193925.0,"y":8.0},{"x":206770.0,"y":2.0}]

The results of this query comes back as a list, which is very un-SQL like! To get the data as a table, we can use LATERAL VIEW and EXPLODE:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
SELECT
   CAST(hist.x as int) as bin_center,
   CAST(hist.y as bigint) as bin_height
FROM (select
      histogram_numeric(salary, 20) as salary_hist
      from
      sample_08) a
LATERAL VIEW explode(salary_hist) exploded_table as hist;

--Results
	bin_center	bin_height
0	23507	255
1	31881	340
2	39824	287
3	47615	181
4	55667	164
5	59952	8
6	66034	137
7	75642	82
8	82496	44
9	91431	60
10	100665	21
11	107326	15
12	121248	16
13	142070	2
14	153896	6
15	162310	6
16	169810	2
17	176740	2
18	193925	8
19	206770	2

However, now that we have a table of data, it’s still not clear how to create a histogram, as the center of variable-width bins is what is returned by Hive. The Hive documentation for histogram_numeric() references Gnuplot, Excel, Mathematica and MATLAB, which I can only assume can deal with plotting the centers?  Eventually I’ll figure out how to deal with this using R or Python, but for now, I just use the table as a quick gauge of what the data looks like.


  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • Data Science Without Leaving the GPU
  • Getting Started With MapD, Part 2: Electricity Dataset
  • Getting Started With MapD, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS