Learning R Has Really Made Me Appreciate SAS

EDIT, 9/9/2016: Four years later, this blog post is a comical look back in time. It’s hard to believe that I could think this way! Having used R (and Python, Julia), I will never return back to the constraints of using SAS. The inflexible nature of everything having to be a Dataset in SAS vs. the infinite flexibility of data structures in programming-oriented languages makes it no contest.

But I’ll leave this here to remind myself how today’s frustration leads to tomorrow’s breakthroughs.


For the past 18 months, it seems like all I’ve heard about in the digital marketing industry is “big data”, and with that, mentions of using Hadoop and R to solve these sorts of problems.  Why are these tools the most often mentioned?  Because they are open source, i.e. free of charge!

But as I’ve tried to learn R, I keep asking myself…are all of my colleagues out of their minds?  Or, am I just beyond learning something new?  As of right now, R is just one big hack on top of a hack to me, and the software is only “free” if you don’t consider lost productivity.

Need new functionality, just download another R package!

One of the biggest “pros” I see thrown around for R relative to a tool like SAS is that when new statistical techniques are invented, someone will code it in R immediately.  A company like SAS make take 5 years to implement the feature, or it may not get implemented at all.  That’s all fine and good, but the problem I’ve found is that there are 10 ways to do something in R, and I spend more time downloading packages (along with other packages that are dependencies) than I do learning A SINGLE WAY to do something correctly.

For example, take trying to get summary statistics by group.  In SAS, you use a Proc Summary statement, with either a BY group statement or a CLASS statement.  It’s fairly simple and it works.

proc summary data= hs0; var _numeric_; class prgtype; output out=results mean= /autolabel autoname inherit; run;

In R, I ran the following code, which should be roughly equivalent:

by(hs0, hs0$prgtype, mean)

Very simple, fewer lines…and technically wrong, throwing a 6 unhelpful errors for a single line of code.  Because it was decided that “mean” as a function would be deprecated in R.  WHY???  It’s so simple, why modify the language like that?

According to the error message, I’m supposed to use colMeans instead…but once you get to how, you’re on your own, the Help documentation is garbage.  Some combination of “by” and “colMeans” might work, but I don’t have an example to follow.

Google sent me to the Quick-R website, and I found a “descriptive statistics” article with by group processing…with the recommendation of using the “psych” package or the “doBy” package.  But CRAN won’t let me download all of the dependencies, so again, stuck trying to do the simplest thing in statistics.

Let’s be fast and run everything in RAM!

My next favorite hassle in R is that you are expected to continuously monitor how many data elements you have active in a workspace.  R runs completely in RAM (as opposed to SAS which runs a combination of RAM for processing and hard disks for storage), so if you want to do something really “big”, you will quickly choke your computer.  I tried to work with a single day of Omniture data from the raw data feed, and my MacBook Pro with 6GB of memory was shot.  I believe the file was 700,000 rows by 300 columns, but I could be mis-remembering.  That’s not even enough data to think about performance-tuning a program in SAS, any slop code will run quickly.

How does one solve these memory errors in R?  Port to Amazon cloud seems to be the most commonly given suggestion.  But that’s more setup time, getting an R instance over to Amazon, your data over to Amazon..and now you are renting hardware.

R is great for data visualization!

From what I’ve seen from the demo(graphics) tutorial, R does have some pretty impressive visualization capabilities.  Contour maps, histograms, boxplots…there seems to be a lot of capability here beyond the realm of a tool like Excel (which, besides not being free, isn’t really for visualization).  SAS has some graphics capabilities, but they are a bit hard to master.

But for all of the hassle to get your data formatted properly, downloading endless packages, avoiding memory errors, you could just pay for Tableau and get working.  Then, once you have your visualizations done in Tableau, if you are using Tableau server you can share interactive dashboards with others.  As far as I know, R graphics are static image exports, so you’re stuck with “flat” presentations.

Maybe, it’s just me

For R diehards, the above verbiage probably just sounds like whining from someone who is too new to appreciate the greatness of R or too stuck in the “old SAS way”.  That’s certainly possible.  But from my first several weeks of trying to use R, the level of frustration is way beyond anything I experienced when I was learning SAS.

Luckily, I don’t currently have any consulting projects that require R or SAS at the moment, so I can continue to try and learn why everyone thinks R is so great.  But from where I sit right now, the licensing fee from SAS doesn’t seem so bad when it allows me to get to doing productive work instead of building my own statistics software piece-by-piece.


My Top 20 Least Useful Omniture Reports

data-squirrel

Just because data CAN be captured doesn't mean it SHOULD be!

In a prior post about customizing the SiteCatalyst menu interface, I discussed how simple changes such as hiding empty Omniture variables/reports and re-organizing the menu structure will help improve understanding within your organization.  In the spirit of even further interface optimization, here are 20 reports within Omniture that I feel that can be hidden due to their lack of business-actionable information.

Here are my Top 20, in no particular order:

  • Mobile:  Color Depth
  • Mobile:  Information Services
  • Mobile:  Decoration Mail Support
  • Mobile:  PTT
  • Mobile:  Device Number Transmit
  • Mobile:  Browser URL Length
  • Mobile:  DRM
  • Mobile:  Mail URL Length
  • Mobile:  Java version
  • Mobile:  Manufacturer
  • Technology:  Connection Types
  • Technology:  Monitor Color Depth
  • Technology:  JavaScript Version
  • Technology:  Monitor Resolutions
  • Visitor Profile:  Top-Level Domains
  • Visitor Profile:  Domains
  • Visitor Profile:  Geosegmentation
  • Traffic Sources:  All Search Page Ranking
  • Traffic Sources: Original Referring Domains
  • Custom Variable:  s.server report

Mobile reports

For the most part, the information in the separate reports can determined just by knowing the device (which is also a default Omniture report). So, a single report can take the place of 10.

There’s also the pesky issue that the reports more often than not show “Unknown” for 90%+ of the mobile traffic (at least, in the U.S.).  So not only can the data be determined from knowing the mobile device being used, the additional reports aren’t even well populated.

Technology reports

The “Connection Type” report, along with “Monitor Color Depth”, measure things that haven’t been an issue in too many years to continue reporting on. LAN, 16-bit or higher.

“Monitor resolution” is irrelevant in the face of also having “Browser Width” & “Browser Height” reports (the true size of the web page “real estate” on screen).

Finally, JavaScript version?  The JavaScript report with “Enabled/Disabled” is likely more than enough information.  Or, you can just include jQuery in your website and know with 100% certainty what version is being used.

Visitor Profile reports

My dislike of the identified Visitor Profile reports are due to halfway implementation.  The “GeoSegmentation report shows a nice map representation, but only of traffic metrics like Page Views and Visits.  Why not open this up to conversion variables and really make the visualization useful, instead of needing to rely on the “flat”, non-map Visitor Zip (s.zip) report?

For the “Domains” and “Top-Level Domains” report, you have granularity issues; the “Top-Level Domains” report is sort-of a country-level report, but the U.S. has several line items.  The “Domains” report shows what ISP people are using to access the Internet (which I think is generally useless in itself), but again…it spans geography, so the ISP network someone is on may not even have the same technology.  So what are we really measuring in these reports?

Traffic Sources reports

The “All Search Page Ranking” report seems like it could be useful, until you realize that 1) it aggregates all search engines (whose different algorithms provide different rankings and 2) with personalized search, rankings are no longer static. Literally every single person could see a different link position for the same search term.  So while this report may have made sense for SEO measurement in the past, it’s really past it’s prime…use the right SEO tool for the job (Conductor, SEOmoz, and the like).

The “Original Referring Domains” report is weird in its own way…the absolute first URL that referred you to the site.  Really?  As Avinash has said, giving 100% credit to the first touchpoint is like giving your first girlfriend credit for you marrying your wife (paraphrased).  This report is very limited in its usefulness IMO, especially given the advances in attribution modeling in the past several years.

Custom Variable:  s.server report

The only custom variable report I have on this list is the s.server report; hopefully, all of your other custom variables are capturing only business-useful information!

The reason I dislike the s.server variable/report is the same reason I dislike the “All Search Page Ranking” report; use the right tool for the job.  This is a lazy way of monitoring server volume for load balancing.  But if you’re doing the job well on the back-end, shouldn’t every server have the same level of volume?

Even if the answer to the previous question is no (I’m not a network engineer, clearly), having an operational report like this doesn’t make much sense to me in a marketing reporting tool.

Hide in the menu, don’t restrict access

By hiding reports in the Omniture menu interface, this doesn’t mean the info stops being collected or becomes unavailable to all users.  Rather, the option to use the reports isn’t immediately obvious (since they don’t show up in the menu).  Power Users can still find these reports using the search box if necessary to answer an oddball question.

But in my experience, the information in these reports are generally not business useful, or are lacking in some critical way.  If you can’t make regular, high impact decisions with the info, then you’re better off never looking at it at all.


Apple Has Earned a Customer for Life

macbook pro broken hinge

Broken MacBook Pro hinge (due to glue failure)

I used to think that when people talked about the “legendary Apple customer service” that there was plenty of hyperbole thrown in for good measure.  Until it happened to me with my broken MacBook Pro hinge.

“Broken MacBook Pro Hinge” - Plenty of search results

When the screen on my late 2008 15” MacBook Pro started separating from the hinge, the first thing I did was search Google.  There I found more than enough search results to make me believe this was a widespread issue with this vintage of laptop.  And since the laptop was out of warranty, most of the results talked about re-gluing the aluminum screen cover to the hinge.

After trying to re-attach the hinge to the screen using epoxy, I headed over to the Apple store in King of Prussia, PA.  To say this first encounter at the Genius Bar was frustrating is an understatement.

You should’ve bought AppleCare

Apple cashiers “Geniuses” and fanboys alike are very big on pushing the AppleCare warranty, selling you with tales that Apple will fix anything in that extended time period.  While that may be true, extended warranties generally don’t pay off for the consumer, and as such, I don’t buy them.

Not that it would have mattered for me anyway.  My MacBook Pro is well beyond 3 years old, one of the first unibody models that came out.  You think the Apple “Genius” would’ve known that after checking the serial number, but instead just kept repeating robotically:

“You should’ve bought AppleCare.  You should’ve bought AppleCare.”

Even when I asked, “A glue failure doesn’t seem like a manufacturers defect?” or “I should’ve paid $349 for an extended warranty to protect against $0.05 of faulty glue?”

“You should’ve bought AppleCare.”

At that point, after being asked if I dropped the laptop, given a series of robotic answers, suggested that I should’ve spend $349 that wouldn’t have fixed my problem, and generally treated like a monkey, I felt like smashing the laptop right on the Genius Bar just to make a scene.  Instead, I walked out feeling worse than when I arrived, with crippled MacBook Pro in hand.

Maybe an Apple Certified Repair facility can help

Since I wasn’t going back for a second round of stupidity at King of Prussia Apple Store, I decided to look up an independent shop to see what the cost of repair would be.  The repair guy immediately said “Oh, I’ve seen this a few times recently…it’s probably around $500-$600 to fix.”

$%^$&%*(#!  For $600, I’d be about 30-35% of a new 15” MacBook Pro.  Again I left a store without doing anything, and feeling worse than when I arrived.  I either need to pay $600 or pay $2000+ to get the newer equivalent of my laptop.

One more trip to the Apple Store

Several weeks had passed and my laptop became pretty much unusable.  I decided to bite the bullet and pay to get the screen fixed.  I also decided to go back to an Apple Store (this time, in Ardmore, PA) to have them fix it.  I figured if I’d have to pay, might as well guarantee it would get fixed properly.

When I walked up to the Genius Bar, the Apple “Genius” still asked me if I dropped my laptop (sidebar:  Is this part of the mind tricks they give everyone?  There isn’t a scratch on the thing, let alone any dents).  After the Apple employee looked over the laptop, I told him in my most dejected voice that I wanted to find out how much is was to replace the screen.

Apple Genius:  “How about ‘free’?”

I damn near fell off the stool I was sitting on.  How could the Apple Store in King of Prussia been so unhelpful, and then 5 minutes into the same explanation I get an offer to get the screen fixed FREE at the Suburban Square Apple Store in Ardmore?

Apple Genius:  “And we can probably get this back to you by tomorrow.”

Needless to say, I didn’t want to do anything except hit ‘Accept’ on the electronic repair form.  I’ve come too far to mess this gift up!

Apple, you’ve earned yourself a lifetime customer

Maybe I got lucky.  Maybe it was perseverance.  Maybe this screen/hinge defect has shown up too many times in the last six weeks and Apple could no longer ignore it.

Maybe it’s because I asked twice at two different Genius appointments. Or maybe Apple has realized I’ve spent several thousand dollars with them in the past several years, with this MacBook Pro, iMac, several iPhones and an iPad.  That level of spend probably doesn’t even get me in the top 50% of non-business customers, but it’s not negligible either.

Whatever the reason, by comping me the $492.41, Apple has “bought” themselves a customer for life.

em209-mac-repair-order

The cost of a broken MacBook Pro hinge? Apparently, $492.41!

Edit: To read the follow-up of what eventually ended up of this MacBook Pro, click here for an article about me replacement battery interaction with Apple.


  • Self-Service Adobe Analytics Data Feeds!
  • RSiteCatalyst Version 1.4.10 Release Notes
  • WordPress to Jekyll: A 30x Speedup
  • Bulk Downloading Adobe Analytics Data
  • Adobe Analytics Clickstream Data Feed: Calculations and Outlier Analysis
  • Adobe: Give Credit. You DID NOT Write RSiteCatalyst.
  • RSiteCatalyst Version 1.4.8 Release Notes
  • Adobe Analytics Clickstream Data Feed: Loading To Relational Database
  • Calling RSiteCatalyst From Python
  • RSiteCatalyst Version 1.4.7 (and 1.4.6.) Release Notes
  • RSiteCatalyst Version 1.4.5 Release Notes
  • Getting Started: Adobe Analytics Clickstream Data Feed
  • RSiteCatalyst Version 1.4.4 Release Notes
  • RSiteCatalyst Version 1.4.3 Release Notes
  • RSiteCatalyst Version 1.4.2 Release Notes
  • Destroy Your Data Using Excel With This One Weird Trick!
  • RSiteCatalyst Version 1.4.1 Release Notes
  • Visualizing Website Pathing With Sankey Charts
  • Visualizing Website Structure With Network Graphs
  • RSiteCatalyst Version 1.4 Release Notes
  • Maybe I Don't Really Know R After All
  • Building JSON in R: Three Methods
  • Real-time Reporting with the Adobe Analytics API
  • RSiteCatalyst Version 1.3 Release Notes
  • Adobe Analytics Implementation Documentation in 60 Seconds
  • RSiteCatalyst Version 1.2 Release Notes
  • Clustering Search Keywords Using K-Means Clustering
  • RSiteCatalyst Version 1.1 Release Notes
  • Anomaly Detection Using The Adobe Analytics API
  • (not provided): Using R and the Google Analytics API
  • My Top 20 Least Useful Omniture Reports
  • For Maximum User Understanding, Customize the SiteCatalyst Menu
  • Effect Of Modified Bounce Rate In Google Analytics
  • Adobe Discover 3: First Impressions
  • Using Omniture SiteCatalyst Target Report To Calculate YOY growth
  • Google Analytics Individual Qualification (IQ) - Passed!
  • Google Analytics SEO reports: Not Ready For Primetime?
  • An Afternoon With Edward Tufte
  • Google Analytics Custom Variables: A Page-Level Example
  • Xchange 2011: Think Tank and Harbor Cruise
  • Google Analytics for WordPress: Two Methods
  • WordPress Stats or Google Analytics? Yes!
  • Building a Data Science Workstation (2017)
  • JuliaCon 2015: Everyday Analytics and Visualization (video)
  • Vega.jl, Rebooted
  • Sessionizing Log Data Using data.table [Follow-up #2]
  • Sessionizing Log Data Using dplyr [Follow-up]
  • Sessionizing Log Data Using SQL
  • Review: Data Science at the Command Line
  • Introducing Twitter.jl
  • Code Refactoring Using Metaprogramming
  • Evaluating BreakoutDetection
  • Creating A Stacked Bar Chart in Seaborn
  • Visualizing Analytics Languages With VennEuler.jl
  • String Interpolation for Fun and Profit
  • Using Julia As A "Glue" Language
  • Five Hard-Won Lessons Using Hive
  • Using SQL Workbench with Apache Hive
  • Getting Started With Hadoop, Final: Analysis Using Hive & Pig
  • Quickly Create Dummy Variables in a Data Frame
  • Using Amazon EC2 with IPython Notebook
  • Adding Line Numbers in IPython/Jupyter Notebooks
  • Fun With Just-In-Time Compiling: Julia, Python, R and pqR
  • Getting Started Using Hadoop, Part 4: Creating Tables With Hive
  • Tabular Data I/O in Julia
  • Hadoop Streaming with Amazon Elastic MapReduce, Python and mrjob
  • A Beginner's Look at Julia
  • Getting Started Using Hadoop, Part 3: Loading Data
  • Innovation Will Never Be At The Push Of A Button
  • Getting Started Using Hadoop, Part 2: Building a Cluster
  • Getting Started Using Hadoop, Part 1: Intro
  • Instructions for Installing & Using R on Amazon EC2
  • Video: SQL Queries in R using sqldf
  • Video: Overlay Histogram in R (Normal, Density, Another Series)
  • Video: R, RStudio, Rcmdr & rattle
  • Getting Started Using R, Part 2: Rcmdr
  • Getting Started Using R, Part 1: RStudio
  • Learning R Has Really Made Me Appreciate SAS