Calling RSiteCatalyst From Python

This will be a very short post, because the only “new” information I’m going to provide is the minimal example to answer the question. Yes, it is in fact possible to call RSiteCatalyst from Python and seems to work well. The most important things are 1) making sure you install rpy2 and 2) loading Pandas (since so much of RSiteCatalyst are API calls returning data frames). It doesn’t hurt to already have experience using RSiteCatalyst in R, since all we’re doing here is using Python to pass code to R.

Setup Code: rpy2 and Pandas

To call an R package from Python, the rpy2 package works very well, both from the REPL and Jupyter Notebook. For RSiteCatalyst, here is the set up code:

With this code run, now you can make calls to the RSiteCatalyst R package, just as if you were in R itself.

Sample Call: GetReportSuites

Just to prove it works, here’s a code snippet using GetReportSuites():
And in Jupyter Notebook, you would see something similar to:
rsitecatalyst-rpy2

But, Why?

So that’s about it…if you wanted to, you could call RSiteCatalyst from Python without much effort. There aren’t a whole lot of reasons to do so, unless like Adam above, you’d rather just use Python. I suppose if you wanted to use some other Python packages, such as Flask to create a dashboard or Seaborn for visualization you might want to do this. Until I got this tweet, it never occurred to me to do this, so YMMV.

Edit, 2/26/16: Adam Gitzes, who originally asked the question, also provides a different solution using Jupyter Notebook magics at his blog post here

RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes

It seems as though I missed release notes for version RSiteCatalyst 1.4.6, so we’ll do those and RSiteCatalyst 1.4.7 (now on CRAN) and the same time…

RSiteCatalyst 1.4.6

This release was mostly tweaking some settings, specifically:

  • Adding a second `top` argument within the Queue* functions for more control on results returned. It used to be the case that a breakdown report with the `top` argument would return, say, the top 10 values of the first variable and up to 50,000 values for the breakdown. Now you can control the second level breakdown as well, such as the top 10 pages and top 5 browsers for those pages.
  • Disable checking of the API call before submitting. I never ran into this, but a user was seeing that the API would return errors in validation under high volume. So if you have any weird issues, disable validation using the `validate = FALSE` keyword argument.
  • The package now handles situation where API returns an unexpected type for the reportID and automatically converts it to the proper type (low-level issue, not a user-facing issue)

Those changes carry forward into version RSiteCatalyst 1.4.7, so there is no reason for a user to stick with this release.

RSiteCatalyst 1.4.7 – No more Unicode Errors!

I was surprised it took so long for someone to report this error, but #151 finally reported a case from a user in Germany where search keywords were being mangled due to the presence of an umlaut. UTF-8 encoding is now the default for both calling the API and processing the results, so this issue will hopefully not arise again.

Additionally, a `locale` argument has been added, to set the proper locale for your report suite. This is specified through the `SCAuth()` function, with the list of possible locales provided by the Adobe documentation. So if the even after using 1.4.7 with UTF-8 encoding by default, you are still seeing errors, try setting the locale to the country you are in/country setting of the report suite.

Feature Requests/Bugs

As always, if you come across bugs or have feature requests, please continue to use the RSiteCatalyst GitHub Issues page to submit issues. Don’t worry about cluttering up the page with tickets, please fill out a new issue for anything you encounter (with code you’ve already tried and is failing), unless you are SURE that it is the same problem someone else is facing.

However, outside of patching really serious bugs, I will likely not spend any more time improving this package in the future; my interests have changed, and RSiteCatalyst is pretty much complete as far as I’m concerned. That said, contributors are also very welcomed. If there is a feature you’d like added, and especially if you can fix an outstanding issue reported at GitHub, we’d love to have your contributions. Willem and I are both parents of young children and have real jobs outside of open-source software creation, so we welcome any meaningful contributions to RSiteCatalyst that anyone would like to contribute.

A Million Text Files And A Single Laptop

GNU Parallel Cat Unix

Wait…What? Why?

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Continue reading

21st Century C: Error 64 on OSX When Using Make

To end 2015, I decided to finally learn C, instead of making it a 2016 resolution! I had previously done the ‘Learn C The Hard Way‘ tutorials, taken about a year off, and thus forgotten everything.

Rather than re-do the same material, I decided to get ’21st Century C’ from O’Reilly and work through that. Unfortunately, there is an error/misprint in the very beginning chapters that makes doing the exercises near impossible on OSX. This error manifests itself as c99: invalid argument `all’ to -W Error 64. If you encounter this error on OSX (I’m using OSX 10.11.2 El Capitan as of writing this post), here are three methods for fixing the issue.

Error 64!

When the discussion of using Makefiles begins on page 15, there is a discussion of the “smallest practicable makefile”, which is just six lines long:
Unfortunately, this doesn’t quite work on OSX. Page 11 in the book sort-of references that a fix is needed, but the directions aren’t so clear…

Error 64, solution 1: Book Fix, updated

To use the book fix, you are supposed to:

  1. Create a file named c99
  2. Put the lines (no quotes) ‘gcc –std=c99 $*’ OR ‘clang $*’ in the c99 file
  3. Add the file to your PATH in Terminal (such as ‘export PATH=”/Users/computeruser:$PATH”‘ if the c99 file were located in /Users/computeruser directory)
  4. Run (no quotes) ‘chmod +x c99’ on the file to make it executable

Once you add this work-around to your PATH, then open a fresh Terminal window (or run ‘source .bash_profile’ to refresh the Bash settings), you should be able to use Make to compile your C code.

But to be honest, this seems like a really weird “fix” to me, as it overrides the C compiler settings for any program run via Terminal. I prefer one of two alternate solutions.

Error 64, solution 2: Makefile Change

As I was researching this, a helpful Twitter user noted:

When you switch the ‘c99’ reference to just ‘cc’ in the Makefile, everything works fine. Here’s the subtlety different, corrected Makefile:

Error 64, solution 3: Switch to Clang

The final solution I came across is rather than using the GCC compiler, you can use an alternate compiler called Clang, which is also generally available on OSX (especially with XCode installed). Like solution 2 above, the Makefile is just subtlety different:

Whether to use GCC or Clang as your compiler is really beyond the scope of this blog post; as 21st Century C discusses, it really shouldn’t matter (especially when you are just learning the mechanics of the language).

Error 64, Be Gone!

There’s not really much more to say at this point; this blog post is mainly documentation for anyone who comes across this error in the future. I’ve continued on through the book using Clang, but suffice to say, it’s not the compiler that writes poor-quality, non-compiling code, it’s the user. Ah, the fun of learning 🙂

Four Tactics For Well Thought Out Business Requirements

One of the most common issues in business (especially large corporations) is trying to nail down the requirements for a given analysis request. The “business people” on the front-lines are talking to their higher-ups about what they think are important questions for the business to solve, but by the time the question gets to the analyst or developer, it sounds something like:

It would be interesting to model using SAS how our customers shop for our merchandise by channel and what overlaps there are between demographics, geography, product type and tenure. But we also have to timebox this, we can’t be boiling-the-ocean just looking for needles-in-a-haystack.

Say WHAT? Mr. Business Person, I cannot help you if you do not run that mess through Unsuck-It first.

In all seriousness, I’ve found there are a few great ways for an analyst to refine a “question” like the one above into an actionable plan of attack. So the next time you get a jargon-filled, completely generic analysis request such as the one above, try these four tactics.

1. All Requests Should Be Phrased In The Form Of A Question

The first thing to notice about the mock interaction above is that there are no question marks; it’s not a question! For an analyst or developer to work effectively, questions need to be presented, not bland statements. For example, a refinement series of questions from the analyst might include:

  • You need a model? What type of model? Do you mean a predictive model, a decision tree for understanding, a PivotTable for you to poke at, a one-page PowerPoint slide to give your boss?
  • You specified four attributes (demographics, geography, product type and tenure). Do you have a hypothesis around these attributes (or are you just brain-blabbing)?
  • What is meant by “shop”? Do you mean how do customers browse our goods online and in stores, the purchase cycle, what goods are frequently purchased together or something else?

Note that in all three of the refinement questions above, you are taking a generic idea and really drilling into what is needed. It is the analyst that is the expert in the techniques for analyzing data, so the analyst should be helping the business person to take a raw analysis request and make it into answerable questions.

2. Separate The Tools From The Question

The second thing to notice in the mock interaction above is the statement “using SAS“. I didn’t write that to pick on SAS, but rather, this exact statement was said to me really early in my career. I had a boss who would try and guess which tool was appropriate for the question he was asking. I presume that he was trying to gauge how hard he thought the problem was, or try to signal to me how hard he thought the problem was. In the end, a plain SQL query with the results copied into an Excel table was all that was necessary.

As the analyst, confirm whether the tool is actually part of the deliverable. Meaning, if you need to deliver a Tableau workbook, ok, specifying “use Tableau” is an important part of the business question. But if the requirement is “production-quality visualizations”, Tableau may or may not be the right tool or might just be one part of a larger workflow.

3. Every Question Is Interesting To Someone. Solve The Valuable Ones.

Paraphrasing the aphorism “The path to hell is paved with good intentions“, the path to doing low-value work your entire career is answering questions that start “Wouldn’t it be interesting if…”.

The basis for these statements are often tangents in other meetings, where high-level executives think there is information that should just be available at everyone’s fingertips. But if you were to ask “What business action would you take if you knew this piece of information?” or “Is it worth me stopping a project worth $1 million in Pre-Tax Profit per month to answer this for you?”, you’ll suddenly the question becomes a lot less interesting.

So always have estimates of the business impact of what you are currently working on and ask for the same estimate of those who ask for your time. Projects that are valuable to the business are “interesting”, everything else is just making work for other people.

4. Don’t Just Solved The Stated Question. Solve The Unstated Question Too.

Finally, when I read the mock interaction above, there are actually two questions:

  • Stated: Do we understand our customer’s purchasing behaviors?
  • Unstated: How do we optimize our business to take into account our customer’s purchasing behaviors?

For sure, a deep understanding of the customer base is important no matter the product. But the unstated question of “What are we doing to do about it?” is so much more valuable to answer (i.e. tactic #3).

So even if the refined question becomes ‘Build a customer segmentation based on past purchases’, go one step further and figure out how to implement your findings. Create a test plan for increasing email click-through-rates based on the segments or optimize your display bidding, maybe build a recommender system for your website…implementation of new ideas is always going to be more valuable than just analyzing the past.

Always Be Assertive.

If the key to sales is “Always Be Closing”, the key to quality analysis is “Always Be Assertive”. Ask questions. Make people think about what they are doing, what they ask of others and what can be done to improve the business. It’s a rare, ego-centric co-worker who doesn’t appreciate collaborating to get to a better quality question (and answer!) than they originally started with.

Being able to read into what other people are asking for, estimating its value, then delivering more than they even knew they were asking for has helped me tremendously throughout my career. Hopefully by doing some or all of the tactics above, you’ll see a marked improvement in your analysis and career as well!