JuliaCon 2015: Everyday Analytics and Visualization (video)

At long last, here’s the video of my presentation from JuliaCon 2015, discussion common analytics tasks and visualization. This is really two talks, the first being an example of using the citibike NYC API to analyze ridership of their public bike program, and the second a discussion of the Vega.jl package.

Speaking at JuliaCon 2015 at MIT CSAIL is the professional highlight of my year; hopefully even more of you will attend next year.

Enjoy!

Edit: For those of you who would like to follow-along using the actual presentation code, it is available on GitHub.

CitiBank Bike Data

Vega.jl Presentation


Apple MacBook Pro Model A1286 Declared Vintage - The End Of An Era

It’s hard to believe it’s been over 2.5 years since I wrote about my experience with Apple trying to get my Broken MacBook Pro Hinge fixed. Since that time, my Late 2008 MacBook Pro continued to work flawlessly, most of the time keeping up with the scientific programming I would do in R, Python or Julia.

Unfortunately, it seems near impossible (if not completely impossible) to get an OEM A1281 battery as a drop-in replacement. When I went to the Apple Store at Suburban Square, PA, the “Genius” that looked at my computer took 15-20 minutes to look on the Apple website (which I obviously did before arriving, so no value-add there), only to show me a battery in stock that didn’t fit my model of computer. Only after shaming him into looking up the actual part number, was he able to utter the phrase:

Oh, no, we don’t have those any more. Your model MacBook Pro was declared “Vintage”. No more original parts are available from Apple.

Of course it is. After getting home, I was able to find this service bulletin from Apple, which outlines which models are obsolete. Apparently, it’s a hard and fast rule that once five years from the end of manufacturing arrives, a model is declared vintage (unless local laws require longer service). So even though the only “problem” with my MacBook Pro is that I was only getting one hour of battery life per charge (or less if I’m compiling code), the computer is destined for a new life somewhere else.

“Vintage” For Me, Powerful For Thee

While I realize I could go the 3rd-party route and get a replacement battery, at some point, you can only spend so much money keeping older technology alive. Since I use computers pretty intensively, I ended up getting a “new” (used) 2011 MacBook Pro from a neighborhood listing that has decent life on the OEM battery. Surprisingly, I was able to get $360 for my Late-2008 MacBook Pro, being fully honest about the condition, issues and battery life. The older woman who I sold it to fully understood, but worked at a desk and didn’t care about the battery! She also said:

This is easily the most powerful computer I’ve ever owned.

Apple, like I said in my original post, you’ve got a customer for life. And while I’ve moved on to a newer machine, it’s beyond amazing to me that a 7-year old computer will continue to live on and work at a high level of performance. And with my 2011 MacBook Pro, I still have the option to upgrade the parts (though I don’t need to…SSD, 16GB of RAM and a quad-core i7 processor already!)

The Retina MacBook’s are nice, but very incremental. Here’s hoping the 2011 MacBook Pro lasts as long as my Late 2008 MacBook Pro did!


Authenticated API Testing Using Travis CI

As I’ve become more serious about contributing in the open-source community, having quality tests for my packages has been something I’ve spent much more time on than when I was just writing quick-and-dirty code for my own purposes. My most used open-sourced package is RSiteCatalyst, which accesses the Adobe Analytics (authenticated) API, which poses a problem: how do you maintain a project on GitHub with a full test suite, while at the same time not hard-coding your credentials in plain sight for everyone to see?

The answer ends up being using encrypted environment variables within Travis CI.

Testthat!

In terms of a testing framework, Hadley Wickham provides a great testing framework in testthat; while I wouldn’t go as far as he does to say that the package makes testing fun, it certainly makes testing easy. Let’s take a look at some of the tests in RSiteCatalyst from the QueueOvertime function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
test_that("Validate QueueOvertime using legacy credentials", {

  skip_on_cran()

  #Correct [masked] credentials
  SCAuth(Sys.getenv("USER", ""), Sys.getenv("SECRET", ""))

  #Single Metric, No granularity (summary report)
  aa <- QueueOvertime("zwitchdev",
                      "2014-01-01",
                      "2014-12-31",
                      "visits",
                      "")

  #Validate returned value is a data.frame
  expect_is(aa, "data.frame")

  #Single Metric, Daily Granularity
  bb <- QueueOvertime("zwitchdev",
                      "2014-01-01",
                      "2014-12-31",
                      "visits",
                      "day")

  #Validate returned value is a data.frame
  expect_is(bb, "data.frame")

  #Single Metric, Week Granularity
  cc <- QueueOvertime("zwitchdev",
                      "2014-01-01",
                      "2014-12-31",
                      "visits",
                      "week")

  #Validate returned value is a data.frame
  expect_is(cc, "data.frame")

  #Two Metrics, Week Granularity
  dd <- QueueOvertime("zwitchdev",
                      "2014-01-01",
                      "2014-12-31",
                      c("visits", "pageviews"),
                      "week")

  #Validate returned value is a data.frame
  expect_is(dd, "data.frame")

  #Two Metrics, Month Granularity, Social Visitors
  ee <- QueueOvertime("zwitchdev",
                      "2014-01-01",
                      "2014-12-31",
                      c("visits", "pageviews"),
                      "month",
                      "5433e4e6e4b02df70be4ac63")

  #Validate returned value is a data.frame
  expect_is(ee, "data.frame")

  #Two Metrics, Day Granularity, Social Visitors, Anomaly Detection
  ff <- QueueOvertime("zwitchdev",
                      "2014-01-01",
                      "2014-12-31",
                      c("visits", "pageviews"),
                      "day",
                      "5433e4e6e4b02df70be4ac63",
                      anomaly.detection = "1")

  #Validate returned value is a data.frame
  expect_is(ff, "data.frame")



})

From the code above, you can see the tests are fairly simplistic; for a given number of permutations of arguments of the function, I test to see if a data frame was returned. This is because, for the most part, RSiteCatalyst is just a means of generating JSON calls, submitting them to the Adobe Analytics API, then parsing the results into an R data frame.

Since there is very little additional logic in the package, I don’t spend a bunch of time testing what data is actually returned (i.e. what is returned depends on the Adobe Analytics API, not R). What is interesting is line 6; I reference Sys.getenv() twice in order to pass in my username and key for the Adobe Analytics API, which feels very “interactive R”, but the goal is automated testing. Filling in those two environment variables is where Travis CI comes in.

Travis CI Configuration

In order to have any automation using Travis CI, you need to create a .travis.yml configuration file. While you can read the Travis docs to create the .travis.yml file for R, you’re probably better off just using the use_travis function from devtools (also from Hadley, little surprise!) to create the file for you. In terms of creating encrypted keys to use with Travis, you’ll need to use the Travis CLI tool, which is distributed as a Ruby gem (i.e. package).  If you view the RSiteCatalyst .travis.yml file, you can see that I define two global “secure” variables, the value of which are the output from running a command similar to the following in the Travis CLI tool:

$ travis encrypt RANDY=ZWITCH
Please add the following to your .travis.yml file:

  secure: "b6S4dBc7arvox8UpuFqkz+VP2UmAW/S/B/vgaAdZiZQqUp78YDR6VYdAYN3WisCK1VLGjOVVPQvGxLik0pQokF8FU3sjX0ekH6vSJeqg4utrEZmVtNvdDLEVAmagFy8Fyduow3U4CPW7rzXqvAE4cIVqGR5Lv2KLf8ANUGn+y3E="

Pro Tip: You can add it automatically by running with --add.

Note that if this seems insecure, every time you run the encrypt command with the same arguments, you get a different value; Travis CI is creating new public and private RSA keys each time.

Setting Up Authenticated Testing Locally

If you get as far as setting up encrypted Travis CI keys and tests using testthat, the final step is really for convenience. With the .travis.yml file, Travis CI sets the R environment variables on THEIR system; on your local machine, the environment variables aren’t set. Even if the environment variables were set, they would be set to the Travis CI hashed values, which is not what I want to pass to my authentication function in my R package.

To set the authentication variables locally, so that each time you hit ‘check’ to build and check against CRAN errors, you just need to modify the .Renviron file for R:

USER="myusername"
SECRET="mysecret"

With that minor change, in addition to the .travis.yml file, you’ll have a seamless environment for developing and testing R packages.

Testing Is Like Flossing…

As easy as the testthat and devtools packages make testing, and as inexpensively as Travis CI is as a service (free for open source projects!), there’s really no excuse to provide packaged-up code and not include tests. Hopefully this blog post has demonstrated that it’s possible to include tests even when authentication is necessary without compromising your credentials.

So let’s all be sure to include tests, not just pay lip service to the idea that testing is useful. Code testing only works if you actually do it 🙂


  • Using RSiteCatalyst With Microsoft PowerBI Desktop
  • RSiteCatalyst Version 1.4.14 Release Notes
  • RSiteCatalyst Version 1.4.13 Release Notes
  • RSiteCatalyst Version 1.4.12 (and 1.4.11) Release Notes
  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • PyData NYC 2018: End-to-End Data Science Without Leaving the GPU
  • Data Science Without Leaving the GPU
  • Getting Started With OmniSci, Part 2: Electricity Dataset
  • Getting Started With OmniSci, Part 1: Docker Install and Loading Data
  • Parallelizing Distance Calculations Using A GPU With CUDAnative.jl
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS