It felt like it would never happen, but RSiteCatalyst v1.4 is now available on CRAN! There are numerous changes in this version of the package, so unlike previous posts, there won’t be any code examples.
THIS VERSION IS ONE BIG BREAKING CHANGE
While not the most important improvement, it can’t be stressed enough that migrating to v1.4 of RSiteCatalyst is likely going to require re-writing some of your prior code. There are numerous reasons for the breaking changes, including:
Adobe made breaking changes to the API between v1.3 and v1.4, so we had to as well
I partnered with Willem Paling, who merged his RAA codebase into RSiteCatalyst to contribute most of the code in this version
Better consistency in R functions around keywords and options
Of the changes listed above, I think #2 and #3 are the biggest benefit to end-users of RSiteCatalyst. The codebase is now much cleaner and more consistent in terms of the keyword arguments, has better error handling, and having a second person helping maintain the project has led to a better overall package.
Where you’ll see the most difference is that all keyword arguments are now all lowercase and multi-word keyword arguments are now separated by a period instead of underscores or weird caMelCAse. We tried to maintain the same keyword order where possible to minimize code re-writes.
Pathing and Fallout Reports
Probably the most useful improvement to RSiteCatalyst comes from those breaking changes by Adobe, which is the inclusion of Pathing and Fallout reports! I can’t say with absolute certainty, but I think with these two additional reports, the API is pretty much at parity to the Adobe Analytics interface itself. So now you can create your funnels using ggplot2, make force-directed graphs or Sankey charts using d3Network or just simple reporting of top ‘Next Pages’ and the like.
Support for OAuth Authentication
As part of Adobe’s commitment to consolidating systems under the single Adobe Marketing Cloud, authentication with the API using OAuth is now possible. How to set up OAuth authentication is beyond the scope of this blog post, but you can get more information at this link: Adobe Marketing Cloud OAuth.
For those of you who don’t have OAuth credentials setup yet, the “legacy” version of authentication is still available in RSiteCatalyst.
GetClassifications, Inline Segmentation and More
Finally, there is now additional functionality on the descriptive side, as you can now download which Classifications are defined for a report suite, segments can be defined inline (i.e. from R) for the ‘Queue’ reports using the BuildClassificationValueSegment() function and functions that existed in previous versions of RSiteCatalyst tend to have more options defined than in previous versions.
Summary/We Want To Hear From You
While this new version of RSiteCatalyst has some annoying breaking changes, overall the package is much more robust than prior versions. I think the increase in functionality is well worth the minor annoyance of re-writing some code. Additionally, eventually Adobe will deprecate v1.3 of their API, so it’s better to move over sooner rather than later.
But for all of the improvements that have been made, there’s always room for improvement, whether it’s fixing unforeseen bugs, adding new features, improving the documentation or anything else. For all suggestions, bug fixes and the like, please submit them to the GitHub repository so that myself and Willem can evaluate and incorporate them. We’re also VERY open to any of you in the R community who are able to patch the code or add new features. As a friend in the data science community says, a Pull Request is always better than a Feature Request 🙂
Note that VennEuler uses some of the more exotic packages (at least to me) like NLopt and Cairo, so you might need to have a few additional dependencies installed with the package.
The data was a bit confusing to me at first, since the percentages add up to more than 100% (people could vote multiple times). In order to create a dataset to use, I took the percentages, multiplied by 1000, then re-created the voting pattern. The data for the graph can be downloaded from this link.
Code - Circles
With a few modifications, I basically re-purposed Harlan’s code from the package test files. The circle result is as follows:
usingVennEulerdata,labels=readcsv("/home/rzwitch/Desktop/kdnuggets_language_survey_2014.csv",header=true)data=bool(data)labels=vec(labels)#Circleseo=make_euler_object(labels,data,EulerSpec())# circles, for now(minf,minx,ret)=optimize(eo,random_state(eo),ftol=-1,xtol=0.0025,maxtime=120,pop=1000)println("got $minf at $minx (returned $ret)")render("/home/rzwitch/Desktop/kd.svg",eo,minx)
Since the percentage of R, SAS, and Python users isn’t too dramatically different (49.81%, 33.42%, 40.97% respectively) and the visualizations are circles, it’s a bit hard to tell that R is about 16% points higher than SAS and 9% points higher than Python.
Code - Rectangles
Alternatively, we can use rectangles to represent the areas:
usingVennEulerdata,labels=readcsv("/home/rzwitch/Desktop/kdnuggets_language_survey_2014.csv",header=true)data=bool(data)labels=vec(labels)# Rectangleseo=make_euler_object(labels,data,[EulerSpec(:rectangle),EulerSpec(:rectangle,[.5,.5,.4],[0,0,0]),EulerSpec(:rectangle)],sizesum=.3)(minf,minx,ret)=optimize_iteratively(eo,random_state(eo),ftol=-1,xtol=0.0025,maxtime=5,pop=100)println("phase 1: got $minf at $minx (returned $ret)")(minf,minx,ret)=optimize(eo,minx,ftol=-1,xtol=0.001,maxtime=30,pop=100)println("phase 2: got $minf at $minx (returned $ret)")render("/home/rzwitch/Desktop/kd-rects.svg",eo,minx)
Here, it’s a slight bit easier to see that SAS and Python are about the same area-wise and that R is larger, although the different dimensions do obscure this fact a bit.
If I spent more time with this package, I’m sure I could make something even more aesthetically pleasing. And for that matter, it’s still a pre-production package that will no doubt get better in the future. But at the very least, there is a way to create an area-proportional representation of relationships using VennEuler.jl in Julia.
In a previous post, I showed how I frequently use Julia as a ‘glue’ language to connect multiple systems in a complicated data pipeline. For this blog post, I will show two more examples where I use Julia for general programming, rather than for computationally-intense programs.
String Building: Introduction
The Strings section of the Julia Manual provides a very in-depth treatment of the considerations when using strings within Julia. For the purposes of my examples, there are only three things to know:
Strings are immutable within Julia and 1-indexed
Strings are easily created through the a syntax familiar to most languages:
String interpolation is easiest done using dollar-sign notation. Additionally, parenthesis can be used to avoid symbol ambiguity:
If you are using large volumes of textual data, you’ll want to pay attention to the difference between the various string types that Julia provides (UTF8/16/32, ASCII, Unicode, etc), but for the purposes of this blog post we’ll just be using the ASCIIString type by not explicitly declaring the string type and only using ASCII characters.
EDIT, 9/8/2016: Starting with version 0.5, Julia defaults to the String type, which is an UTF-8 character encoding.
Example 1: Repetitive Queries
As part of my data engineering responsibilities at work, I often get requests to pull a sample of every table in a new database in our Hadoop cluster. This type of request is usually from the business owner, who wants to evaluate the data set has been imported correctly, but doesn’t actually want to write any sort of queries. So using the ODBC.jl package, I repeatedly do the same select * from <tablename> query and save to individual .tab files:
____(_)_|Afreshapproachtotechnicalcomputing(_)|(_)(_)|Documentation:http://docs.julialang.org___||____|Type"help()"tolisthelptopics|||||||/_`|||||_||||(_|||Version0.3.0-prerelease+4028(2014-07-0223:42UTC)_/|\__'_|_|_|\__'_||Commit2185bd1(11daysoldmaster)|__/|x86_64-w64-mingw32julia>usingODBCjulia>ODBC.connect("Production hiveserver2",usr="",pwd="")ODBCConnectionObject----------------------ConnectionDataSource:Productionhiveserver2Productionhiveserver2ConnectionNumber:1Containsresultset?Nojulia>tables=query("show tables in db;");elapsedtime:0.167028049secondsjulia>fortblintables[:tab_name]query("select * from db.$(tbl) limit 1000;";output="C:\\data_dump\\$(tbl).tab",delim='\t')endjulia>
While the query is simple, writing/running this hundreds of times would be a waste of effort. So with a simple loop over the array of tables, I can provide a sample of hundreds of tables in .tab files with five lines of code.
Example 2: Generating Query Code
In another task, I was asked to join a handful of Hive tables, then transpose the table from “long” to “wide”, so that each id value only had one row instead of multiple. This is fairly trivial to do using CASE statements in SQL; the problem arises when you have thousands of potential row values to transpose into columns! Instead of getting carpal tunnel syndrome typing out thousands of CASE statements, I decided to use Julia to generate the SQL code itself:
#Starting portion of query, the groupby columnsjulia>groupbycols="select
interact.durationinms,"#Generate CASE statements based on the number of possible values of queryidjulia>function casestatements(repetitions::Int64)forqueryidin1:repetitionsprintln("MAX(CASE WHEN q.queryid = $queryid then q.score END) as q$(queryid)score,")endforqueryidin1:repetitionsprintln("MIN(CASE WHEN q.queryid = $queryid then q.startoffsetinms END) as q$(queryid)startoffset,")endforqueryidin1:repetitionsprintln("MAX(CASE WHEN q.queryid = $queryid then q.endoffsetinms END) as q$(queryid)endoffset,")end#Last clause, so repeat it up to number of repetitions minus 1, then do simple print to get line without comma at endforqueryidin1:repetitions-1println("SUM(CASE WHEN q.queryid = $queryid and q.score > q.mediumthreshold THEN 1 END) as q$(queryid)hits,")endprintln("SUM(CASE WHEN q.queryid = $repetitions and q.score > q.mediumthreshold THEN 1 END) as q$(repetitions)hits")end#Ending table statementjulia>tablestatements="from db.table1 as interact
left join db.table2 as q on (interact.interactionid = q.interactionid)
left join db.table3 as t on (interact.interactionid = t.interactionid)
The example here only repeats the CASE statements five times, which wouldn’t really be that much typing. However, for my actual application, the number of possible values was 2153, leading to a query result which was 8157 columns! Suffice to say, I’d still be writing that code if I decided to do it by hand.
Like my ‘glue language’ post, I hope this post has shown that Julia can be used for more than grunting about microbenchmark performance. Whereas I used to use Python for doing weird string operations like this, I’m finding that the dollar-sign syntax in Julia feels more comfortable for me than the Python string formatting mini-language (although that’s not particularly difficult either). So if you’ve been hesitant to jump into learning Julia because you think it’s only useful for doing Mandelbrot calculations or complex linear algebra, Julia is just as at-home doing quick general programming tasks as well.