Visualizing Analytics Languages With VennEuler.jl

It often doesn’t take much to get me off track, and on a holiday weekend…well, I was just begging for a fun way to shirk. Enter Harlan Harris:

Hey, I’m someone looking for something to do! And I like writing Julia code! So let’s have a look at recreating this diagram in Julia using VennEuler.jl (IJulia Notebook link):

Source: Revolution R/KDNuggets

http://blog.revolutionanalytics.com/2014/08/r-tops-kdnuggets-data-analysis-software-poll-for-4th-consecutive-year.html

Installing VennEuler.jl

Because VennEuler.jl is not in METADATA as of the time of writing, instead of using Pkg.add() you’ll need to run:

1
Pkg.clone("https://github.com/HarlanH/VennEuler.jl.git")

Note that VennEuler uses some of the more exotic packages (at least to me) like NLopt and Cairo, so you might need to have a few additional dependencies installed with the package.

Data

The data was a bit confusing to me at first, since the percentages add up to more than 100% (people could vote multiple times). In order to create a dataset to use, I took the percentages, multiplied by 1000, then re-created the voting pattern. The data for the graph can be downloaded from this link.

Code - Circles

With a few modifications, I basically re-purposed Harlan’s code from the package test files. The circle result is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
using VennEuler

data, labels = readcsv("/home/rzwitch/Desktop/kdnuggets_language_survey_2014.csv", header=true)
data = bool(data)
labels = vec(labels)

#Circles
eo = make_euler_object(labels, data, EulerSpec()) # circles, for now

(minf,minx,ret) = optimize(eo, random_state(eo), ftol=-1, xtol=0.0025, maxtime=120, pop=1000)
println("got $minf at $minx (returned $ret)")

render("/home/rzwitch/Desktop/kd.svg", eo, minx)

venneulercircles

Since the percentage of R, SAS, and Python users isn’t too dramatically different (49.81%, 33.42%, 40.97% respectively) and the visualizations are circles, it’s a bit hard to tell that R is about 16% points higher than SAS and 9% points higher than Python.

Code - Rectangles

Alternatively, we can use rectangles to represent the areas:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
using VennEuler

data, labels = readcsv("/home/rzwitch/Desktop/kdnuggets_language_survey_2014.csv", header=true)
data = bool(data)
labels = vec(labels)

# Rectangles
eo = make_euler_object(labels, data, [EulerSpec(:rectangle), EulerSpec(:rectangle, [.5, .5, .4], [0, 0, 0]),
    EulerSpec(:rectangle)],
    sizesum=.3)


(minf,minx,ret) = optimize_iteratively(eo, random_state(eo), ftol=-1, xtol=0.0025, maxtime=5, pop=100)
println("phase 1: got $minf at $minx (returned $ret)")
(minf,minx,ret) = optimize(eo, minx, ftol=-1, xtol=0.001, maxtime=30, pop=100)
println("phase 2: got $minf at $minx (returned $ret)")

render("/home/rzwitch/Desktop/kd-rects.svg", eo, minx)

venneulerrectangles

Here, it’s a slight bit easier to see that SAS and Python are about the same area-wise and that R is larger, although the different dimensions do obscure this fact a bit.

Summary

If I spent more time with this package, I’m sure I could make something even more aesthetically pleasing. And for that matter, it’s still a pre-production package that will no doubt get better in the future. But at the very least, there is a way to create an area-proportional representation of relationships using VennEuler.jl in Julia.


String Interpolation for Fun and Profit

In a previous post, I showed how I frequently use Julia as a ‘glue’ language to connect multiple systems in a complicated data pipeline. For this blog post, I will show two more examples where I use Julia for general programming, rather than for computationally-intense programs.

String Building: Introduction

The Strings section of the Julia Manual provides a very in-depth treatment of the considerations when using strings within Julia. For the purposes of my examples, there are only three things to know:

  • Strings are immutable within Julia and 1-indexed
  • Strings are easily created through the a syntax familiar to most languages:
julia> authorname = "randy zwitch"
"randy zwitch"

julia> typeof(authorname)
String
  • String interpolation is easiest done using dollar-sign notation. Additionally, parenthesis can be used to avoid symbol ambiguity:
julia> interpolated = "the author of this blog post is $(authorname)"
"the author of this blog post is randy zwitch"

If you are using large volumes of textual data, you’ll want to pay attention to the difference between the various string types that Julia provides (UTF8/16/32, ASCII, Unicode, etc), but for the purposes of this blog post we’ll just be using the ASCIIString type by not explicitly declaring the string type and only using ASCII characters.

EDIT, 9/8/2016: Starting with version 0.5, Julia defaults to the String type, which is an UTF-8 character encoding.

Example 1: Repetitive Queries

As part of my data engineering responsibilities at work, I often get requests to pull a sample of every table in a new database in our Hadoop cluster. This type of request is usually from the business owner, who wants to evaluate the data set has been imported correctly, but doesn’t actually want to write any sort of queries. So using the ODBC.jl package, I repeatedly do the same select * from <tablename> query and save to individual .tab files:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
_
_       _ _(_)_     |  A fresh approach to technical computing
(_)     | (_) (_)    |  Documentation: http://docs.julialang.org
_ _   _| |_  __ _   |  Type "help()" to list help topics
| | | | | | |/ _` |  |
| | |_| | | | (_| |  |  Version 0.3.0-prerelease+4028 (2014-07-02 23:42 UTC)
_/ |\__'_|_|_|\__'_|  |  Commit 2185bd1 (11 days old master)
|__/                   |  x86_64-w64-mingw32
julia> using ODBC
julia> ODBC.connect("Production hiveserver2", usr="", pwd="")
ODBC Connection Object
----------------------
Connection Data Source: Production hiveserver2
Production hiveserver2 Connection Number: 1
Contains resultset? No
julia> tables = query("show tables in db;");
elapsed time: 0.167028049 seconds
julia> for tbl in tables[:tab_name]
query("select * from db.$(tbl) limit 1000;"; output="C:\\data_dump\\$(tbl).tab", delim='\t')
end

julia>

While the query is simple, writing/running this hundreds of times would be a waste of effort. So with a simple loop over the array of tables, I can provide a sample of hundreds of tables in .tab files with five lines of code.

Example 2: Generating Query Code

In another task, I was asked to join a handful of Hive tables, then transpose the table from “long” to “wide”, so that each id value only had one row instead of multiple. This is fairly trivial to do using CASE statements in SQL; the problem arises when you have thousands of potential row values to transpose into columns! Instead of getting carpal tunnel syndrome typing out thousands of CASE statements, I decided to use Julia to generate the SQL code itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
#Starting portion of query, the groupby columns
julia> groupbycols ="select
interact.interactionid,
interact.agentname,
interact.agentid,
interact.agentgroup,
interact.agentsupervisor,
interact.sitename,
interact.dnis,
interact.agentextension,
interact.interactiondirection,
interact.interactiontype,
interact.customerid,
interact.customercity,
interact.customerstate,
interact.interactiondatetime,
interact.durationinms,"

#Generate CASE statements based on the number of possible values of queryid
julia> function casestatements(repetitions::Int64)
	for queryid in 1:repetitions
		println("MAX(CASE WHEN q.queryid = $queryid then q.score END) as q$(queryid)score,")
	end
	for queryid in 1:repetitions
		println("MIN(CASE WHEN q.queryid = $queryid then q.startoffsetinms END) as q$(queryid)startoffset,")
	end
	for queryid in 1:repetitions
		println("MAX(CASE WHEN q.queryid = $queryid then q.endoffsetinms END) as q$(queryid)endoffset,")
	end
	#Last clause, so repeat it up to number of repetitions minus 1, then do simple print to get line without comma at end
	for queryid in 1:repetitions-1
		println("SUM(CASE WHEN q.queryid = $queryid and q.score > q.mediumthreshold THEN 1 END) as q$(queryid)hits,")
	end
	println("SUM(CASE WHEN q.queryid = $repetitions and q.score > q.mediumthreshold THEN 1 END) as q$(repetitions)hits")
end

#Ending table statement
julia> tablestatements = "from db.table1 as interact
left join db.table2 as q on (interact.interactionid = q.interactionid)
left join db.table3 as t on (interact.interactionid = t.interactionid)
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15;"

#Submitting all of the statements on one line is usually frowned upon, but this will generate my SQL code
julia> println(groupbycols);casestatements(5);println(tablestatements)
select
interact.interactionid,
interact.agentname,
interact.agentid,
interact.agentgroup,
interact.agentsupervisor,
interact.sitename,
interact.dnis,
interact.agentextension,
interact.interactiondirection,
interact.interactiontype,
interact.customerid,
interact.customercity,
interact.customerstate,
interact.interactiondatetime,
interact.durationinms,
MAX(CASE WHEN q.queryid = 1 then q.score END) as q1score,
MAX(CASE WHEN q.queryid = 2 then q.score END) as q2score,
MAX(CASE WHEN q.queryid = 3 then q.score END) as q3score,
MAX(CASE WHEN q.queryid = 4 then q.score END) as q4score,
MAX(CASE WHEN q.queryid = 5 then q.score END) as q5score,
MIN(CASE WHEN q.queryid = 1 then q.startoffsetinms END) as q1startoffset,
MIN(CASE WHEN q.queryid = 2 then q.startoffsetinms END) as q2startoffset,
MIN(CASE WHEN q.queryid = 3 then q.startoffsetinms END) as q3startoffset,
MIN(CASE WHEN q.queryid = 4 then q.startoffsetinms END) as q4startoffset,
MIN(CASE WHEN q.queryid = 5 then q.startoffsetinms END) as q5startoffset,
MAX(CASE WHEN q.queryid = 1 then q.endoffsetinms END) as q1endoffset,
MAX(CASE WHEN q.queryid = 2 then q.endoffsetinms END) as q2endoffset,
MAX(CASE WHEN q.queryid = 3 then q.endoffsetinms END) as q3endoffset,
MAX(CASE WHEN q.queryid = 4 then q.endoffsetinms END) as q4endoffset,
MAX(CASE WHEN q.queryid = 5 then q.endoffsetinms END) as q5endoffset,
SUM(CASE WHEN q.queryid = 1 and q.score > q.mediumthreshold THEN 1 END) as q1hits,
SUM(CASE WHEN q.queryid = 2 and q.score > q.mediumthreshold THEN 1 END) as q2hits,
SUM(CASE WHEN q.queryid = 3 and q.score > q.mediumthreshold THEN 1 END) as q3hits,
SUM(CASE WHEN q.queryid = 4 and q.score > q.mediumthreshold THEN 1 END) as q4hits,
SUM(CASE WHEN q.queryid = 5 and q.score > q.mediumthreshold THEN 1 END) as q5hits
from db.table1 as interact
left join db.table2 as q on (interact.interactionid = q.interactionid)
left join db.table3 as t on (interact.interactionid = t.interactionid)
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15;

julia>

The example here only repeats the CASE statements five times, which wouldn’t really be that much typing. However, for my actual application, the number of possible values was 2153, leading to a query result which was 8157 columns! Suffice to say, I’d still be writing that code if I decided to do it by hand.

Summary

Like my ‘glue language’ post, I hope this post has shown that Julia can be used for more than grunting about microbenchmark performance. Whereas I used to use Python for doing weird string operations like this, I’m finding that the dollar-sign syntax in Julia feels more comfortable for me than the Python string formatting mini-language (although that’s not particularly difficult either). So if you’ve been hesitant to jump into learning Julia because you think it’s only useful for doing Mandelbrot calculations or complex linear algebra, Julia is just as at-home doing quick general programming tasks as well.


Maybe I Don't Really Know R After All

Lately, I’ve been feeling that I’m spreading myself too thin in terms of programming languages. At work, I spend most of my time in Hive/SQL, with the occasional Python for my smaller data. I really prefer Julia, but I’m alone at work on that one. And since I maintain a package on CRAN (RSiteCatalyst), I frequently spend my evenings bug fix programming in R. Then, there’s the desire to learn a Java-based language like Scala (or, Java)…maybe Spark for my Hadoop work…

So last night, when I ran into this series of follies with R, it really makes me wonder if I really understand how R works.

jsonlite:fromJSON

As part of the overall concept of my RSiteCatalyst package, I’m trying to make it as easy as possible for digital analysts to get their data via the Adobe Analytics API.  As such, I abstract away the need to build JSON to request reports and parse the API answer from JSON to a data frame. Sometimes it’s easy, but sometimes you get something like this:

nested_r_dataframe

In case it’s not clear what’s going on here, fromJSON() from jsonlite returns a data frame as best as it can, but we have a list (of data frames!) nested inside of a column named “breakdown”. There are 12 rows here, but the proper data structure would be to take the data frame inside of ‘breakdown’ and append all of the fields from the original 12 rows, repeating the values down the rows. So something like 72 rows (12 original rows, 6 row data frames inside of the ‘breakdown’ column).

Loop and Accumulate

Because this is such a small data frame, and because *apply functions are too frustrating in most cases, to parse this I went with the tried-and-true loop and accumulate. But instead of immediately getting what I wanted, I got this fantastic R error message:

1
2
3
4
5
6
7
8
9
10
11
12
#Loop over df and accumulate results
parsed_df <- data.frame()
for(i in 1:nrow(df)){
  temp <- cbind(df[i,],breakdown_list[[i]])
  parsed_df <- rbind(parsed_df, temp)
}

There were 12 warnings (use warnings() to see them)
>warnings()
Warning messages:
  1: In data.frame(..., check.names = FALSE) :
  row names were found from a short variable and have been discarded

Row names from a short variable? Off to StackOverflow, the savior of all language hackers, which lets me know I just need to add an argument to my cbind() function. Trying again:

1
2
3
4
5
6
7
8
9
10
11
12
13
#Loop over df and accumulate results
#Adding row.names = NULL fixes error message
parsed_df <- data.frame()
for(i in 1:nrow(df)){
  temp <- cbind(df[i,],breakdown_list[[i]], row.names = NULL)
  parsed_df <- rbind(parsed_df, temp)
}

names(parsed_df)

> names(parsed_df)
 [1] "name"           "year"           "month"          "day"            "hour"           "minute"         "breakdownTotal"
 [8] "name"           "trend"          "counts"  

So I successfully created an (84,10)-sized data frame, but cbind() allowed me to name two columns in the data frame “name”! Running ‘parsed_df$name’ at the REPL returns the first instance. So now, I have to use the unstable method of referring to the second ‘name’ column by position number if I want to access it (or, rename it using names() of course). The way I realized this behavior was occurring was that I tried to use plyr::rename and kept changing the name of two columns!

Final Solution

In order to get past my duplicate name issue, I eventually renamed the ‘name’ columns individually by each object, prior to cbind():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Separate breakdown list and original data frame into different objects
df <- ex_df$report$data
breakdown_list <- df$breakdown
df$breakdown <- NULL

#Loop over df and accumulate results
parsed_df <- data.frame()
for(i in 1:nrow(df)){
  right_df <-  breakdown_list[[i]]
  right_df <- rename(right_df, replace=c("name" = report_raw$report$elements$id[2]))
  temp <- cbind(df[i,],right_df, row.names = NULL)
  parsed_df <- rbind(parsed_df, temp)
}
parsed_df <- rename(parsed_df, replace=c("counts" = report_raw$report$metrics$id))

In the end, I found an answer to my solution, but it seems like every time I use R the more oddities I’m able to encounter/generate. At this point, I’m starting to question whether I really understand the underpinnings of how R works. It might be time to stop trying to be a language polyglot so much and focus on really learning a few of these tools in-depth.


  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • Google Analytics Individual Qualification (IQ) - Passed!
  • Google Analytics SEO reports: Not Ready For Primetime?
  • An Afternoon With Edward Tufte
  • Google Analytics Custom Variables: A Page-Level Example
  • Xchange 2011: Think Tank and Harbor Cruise
  • Google Analytics for WordPress: Two Methods
  • WordPress Stats or Google Analytics? Yes!
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS