#Generate example dataframe with character column
example<-as.data.frame(c("A","A","B","F","C","G","C","D","E","F"))names(example)<-"strcol"#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:
The problem you are trying to solve
How much RAM you have available
While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful. Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing 😉
Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:
@randyzwitch If you're running out of RAM with dummy variables, you probably want to use a sparse matrix instead of a data.frame.
Luckily for Adobe Analytics customers, the API provides a means of generating the framework for a properly-documented implementation. Here’s how to do it using RSiteCatalyst.
Generating Adobe Analytics documentation file
The code below outlines the commands needed to generate an Excel file (see example) with six tabs containing the basic structure of an Adobe Analytics. This report contains all of the report suites you have access to, the elements that reports can be broken down by, traffic variables (props), conversion variables (eVars) and segments available for reporting.
Additionally, within each tab metadata is provided that contains the various settings for variables, so you’ll be able to document the expiration settings for eVars, participation, list variables, segment types and so on.
library("RSiteCatalyst")library("WriteXLS")#Validate that underlying Perl modules for WriteXLS are installed correctly
#Will return "Perl found. All required Perl modules were found" if installed correctly
testPerl()#### 1. Pull data for all report suites to create one comprehensive report ####
#Authenticate with Adobe Analytics API
SCAuth("user:company","sharedsecret")#Get Report Suites
report_suites<-GetReportSuites()#Get Available Elements
segments<-GetSegments(report_suites$rsid)#Get Success Events
events<-GetSuccessEvents(report_suites$rsid)#Get Traffic Vars
props<-GetProps(report_suites$rsid)#### 2. Generate a single Excel file
#Create list of report suite objects, written as strings
objlist<-c("report_suites","elements","evars","segments","events","props")#Write out Excel file with auto-width columns, a bolded header row and filters turned on
The only “gotchas” to keep in mind when using the script above is that the user running this will only receive data for report suites they have access to (which is determined by Admin panel setting within Adobe Analytics) and that you need to have the WriteXLS package installed to write to Excel. The WriteXLS package uses Perl as the underlying code, so you’ll need to validate that the package is installed correctly, which is done using the testPerl() function in the package.
This is pretty bare-bones, no?
After you run this code, you’ll have an Excel file that has all of the underlying characteristics of your Adobe Analytics implementation. It’s important to realize that this is only the starting point; a great set of documentation will contain other pieces of information such as where/when the value is set (on entry, every page, when certain events occur, etc.), a layman’s explanation about what the data element means and other business information so your stakeholders can be confident they are using the data correctly. Additionally, you might consider creating a single Excel file for every report suite in your implementation. It’s trivial to modify the code above to subset each data frame used above for a single value of rsid, then write to separate Excel files. Regardless of how your structure your documentation, DOCUMENT YOUR IMPLEMENTATION! The employees that come after you (and your future self!) will thank you.
EDIT 2/4/2016: Thanks to reader @CSitty for pointing out the R code became a little stale. The documentation generating code should now work again for RSiteCatalyst versions >= 1.4 and WriteXLS >= 4.0 (basically, any current version as of the time of this update).
Last week, I wrote a guest blog post at Bad Hessian about how to use IPython Notebook along with Amazon EC2 as your data science & analytics platform. I won’t reproduce the whole article here, but if you are interested in step-by-step instruction on how to setup an Amazon EC2 instance to use IPython Notebook, see the SlideShare presentation below which outlines the steps needed to setup a remote IPython Notebook environment (or, PDF download).
If you already have experience setting up EC2 images and just need the IPython Notebook settings, here are the commands that are needed to set up your IPython public notebook server.
#### Start IPython, generate SHA1 password to use for IPython Notebook server$ipythonPython2.7.5|Anaconda1.8.0(x86_64)|(default,Oct242013,07:02:20)Type"copyright","credits"or"license"formoreinformation.IPython1.1.0--AnenhancedInteractivePython.?->IntroductionandoverviewofIPython's features.%quickref -> Quick reference.help -> Python'sownhelpsystem.object?->Detailsabout'object',use'object??'forextradetails.In:fromIPython.libimportpasswdIn:passwd()Enterpassword:Verifypassword:Out:'sha1:207eb1f4671f:92af695...'#### Create nbserver profile$ipythonprofilecreatenbserver[ProfileCreate]Generatingdefaultconfigfile:u'/.ipython/profile_nbserver/ipython_config.py'[ProfileCreate]Generatingdefaultconfigfile:u'/.ipython/profile_nbserver/ipython_qtconsole_config.py'[ProfileCreate]Generatingdefaultconfigfile:u'/.ipython/profile_nbserver/ipython_notebook_config.py'[ProfileCreate]Generatingdefaultconfigfile:u'/.ipython/profile_nbserver/ipython_nbconvert_config.py'#### Create self-signed SSL certificate$opensslreq-x509-nodes-days365-newkeyrsa:1024-keyoutmycert.pem-outmycert.pem#### Modify ipython_notebook_config.py configuration file#### Add these lines to the top of the file; no other changes necessary#### Obviously, you'll want to add your path to the .pem key and your password# Configuration file for ipython-notebook.c=get_config()# Kernel configc.IPKernelApp.pylab='inline'# if you want plotting support always# Notebook configc.NotebookApp.certfile=u'/home/ubuntu/certificates/mycert.pem'c.NotebookApp.ip='*'c.NotebookApp.open_browser=Falsec.NotebookApp.password=u'sha1:207eb1f4671f:92af695...'# It is a good idea to put it on a known, fixed portc.NotebookApp.port=8888#### Start IPython Notebook on the remote server$ipythonnotebook--profile=nbserver