Quickly Create Dummy Variables in a Data Frame

Randy Zwitch × January 2, 2014 × DataScience

On Quora, a question was asked about how to fix the error of the randomForest package in R not being able to handle more than 32 levels in a categorical variable. Seeing as how I’ve seen this question asked on Kaggle forums, StackOverflow and elsewhere, here’s the answer: code your own dummy variables instead of relying on Factors!

Code snippet

1
2
3
4
5
6
7
8
9
10
#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"

#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
  example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}
view raw

As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:

The problem you are trying to solve
How much RAM you have available

While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful. Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing 😉

Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:

@randyzwitch If you're running out of RAM with dummy variables, you probably want to use a sparse matrix instead of a data.frame.

— John Myles White (@johnmyleswhite) January 2, 2014

Adobe Analytics Implementation Documentation in 60 Seconds

Randy Zwitch × December 9, 2013 × Analytics

When I was working as a digital analytics consultant, no question quite had the ability to cause belly laughs AND angst as, “Can you send me an updated copy of your implementation documentation?” I saw companies that were spending six-or-seven-figures annually on their analytics infrastructure, multi-millions in salary for employees and yet the only way to understand what data they were collecting was to inspect their JavaScript code.

Luckily for Adobe Analytics customers, the API provides a means of generating the framework for a properly-documented implementation. Here’s how to do it using RSiteCatalyst.

Generating Adobe Analytics documentation file

The code below outlines the commands needed to generate an Excel file (see example) with six tabs containing the basic structure of an Adobe Analytics. This report contains all of the report suites you have access to, the elements that reports can be broken down by, traffic variables (props), conversion variables (eVars) and segments available for reporting.

Additionally, within each tab metadata is provided that contains the various settings for variables, so you’ll be able to document the expiration settings for eVars, participation, list variables, segment types and so on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
library("RSiteCatalyst")
library("WriteXLS")

#Validate that underlying Perl modules for WriteXLS are installed correctly
#Will return "Perl found. All required Perl modules were found" if installed correctly
testPerl()

#### 1. Pull data for all report suites to create one comprehensive report ####

#Authenticate with Adobe Analytics API
SCAuth("user:company", "sharedsecret")

#Get Report Suites
report_suites <- GetReportSuites()

#Get Available Elements
elements <- GetElements(report_suites$rsid)

#Get eVars
evars <- GetEvars(report_suites$rsid)

#Get Segments
segments <- GetSegments(report_suites$rsid)

#Get Success Events
events <- GetSuccessEvents(report_suites$rsid)

#Get Traffic Vars
props <- GetProps(report_suites$rsid)

#### 2. Generate a single Excel file

#Create list of report suite objects, written as strings
objlist <- c("report_suites", "elements", "evars", "segments", "events", "props")

#Write out Excel file with auto-width columns, a bolded header row and filters turned on
WriteXLS(objlist, "/Users/randyzwitch/Desktop/adobe_analytics_implementation_doc.xlsx",
         AdjWidth = TRUE, BoldHeaderRow = TRUE, AutoFilter = TRUE)

The only “gotchas” to keep in mind when using the script above is that the user running this will only receive data for report suites they have access to (which is determined by Admin panel setting within Adobe Analytics) and that you need to have the WriteXLS package installed to write to Excel. The WriteXLS package uses Perl as the underlying code, so you’ll need to validate that the package is installed correctly, which is done using the testPerl() function in the package.

This is pretty bare-bones, no?

After you run this code, you’ll have an Excel file that has all of the underlying characteristics of your Adobe Analytics implementation. It’s important to realize that this is only the starting point; a great set of documentation will contain other pieces of information such as where/when the value is set (on entry, every page, when certain events occur, etc.), a layman’s explanation about what the data element means and other business information so your stakeholders can be confident they are using the data correctly. Additionally, you might consider creating a single Excel file for every report suite in your implementation. It’s trivial to modify the code above to subset each data frame used above for a single value of rsid, then write to separate Excel files. Regardless of how your structure your documentation, DOCUMENT YOUR IMPLEMENTATION! The employees that come after you (and your future self!) will thank you.

EDIT 2/4/2016: Thanks to reader @CSitty for pointing out the R code became a little stale. The documentation generating code should now work again for RSiteCatalyst versions >= 1.4 and WriteXLS >= 4.0 (basically, any current version as of the time of this update).

Using Amazon EC2 with IPython Notebook

Randy Zwitch × November 21, 2013 × DataScience

Last week, I wrote a guest blog post at Bad Hessian about how to use IPython Notebook along with Amazon EC2 as your data science & analytics platform. I won’t reproduce the whole article here, but if you are interested in step-by-step instruction on how to setup an Amazon EC2 instance to use IPython Notebook, see the SlideShare presentation below which outlines the steps needed to setup a remote IPython Notebook environment (or, PDF download).

If you already have experience setting up EC2 images and just need the IPython Notebook settings, here are the commands that are needed to set up your IPython public notebook server.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#### Start IPython, generate SHA1 password to use for IPython Notebook server

$ ipython
Python 2.7.5 |Anaconda 1.8.0 (x86_64)| (default, Oct 24 2013, 07:02:20)
Type "copyright", "credits" or "license" for more information.

IPython 1.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from IPython.lib import passwd

In [2]: passwd()
Enter password:
Verify password:
Out[2]: 'sha1:207eb1f4671f:92af695...'

#### Create nbserver profile

$ ipython profile create nbserver
[ProfileCreate] Generating default config file: u'/.ipython/profile_nbserver/ipython_config.py'
[ProfileCreate] Generating default config file: u'/.ipython/profile_nbserver/ipython_qtconsole_config.py'
[ProfileCreate] Generating default config file: u'/.ipython/profile_nbserver/ipython_notebook_config.py'
[ProfileCreate] Generating default config file: u'/.ipython/profile_nbserver/ipython_nbconvert_config.py'

#### Create self-signed SSL certificate

$ openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

#### Modify ipython_notebook_config.py configuration file
#### Add these lines to the top of the file; no other changes necessary
#### Obviously, you'll want to add your path to the .pem key and your password

# Configuration file for ipython-notebook.

c = get_config()

# Kernel config
c.IPKernelApp.pylab = 'inline'  # if you want plotting support always

# Notebook config
c.NotebookApp.certfile = u'/home/ubuntu/certificates/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = u'sha1:207eb1f4671f:92af695...'
# It is a good idea to put it on a known, fixed port
c.NotebookApp.port = 8888

#### Start IPython Notebook on the remote server

$ ipython notebook --profile=nbserver

Happy IPython Notebooking!

« Prev 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 Next »