Quickly Create Dummy Variables in a Data Frame

On Quora, a question was asked about how to fix the error of the randomForest package in R not being able to handle more than 32 levels in a categorical variable. Seeing as how I’ve seen this question asked on Kaggle forums, StackOverflow and elsewhere, here’s the answer: code your own dummy variables instead of relying on Factors!

Code snippet

As the code above shows, it’s trivial to generate your own 1/0 columns of data instead of relying on Factors. There are two things to keep in mind when creating your own dummy variables:

  1. The problem you are trying to solve
  2. How much RAM you have available

While it may make sense to generate dummy variables for Customer State (~50 for the United States), if you were to use the code above on City Name, you’d likely either run out of RAM or find out that there are too many levels to be useful.

Of course, with any qualitative statement such as “too many levels to be useful”, oftentimes the only way to definitively know is to try it! Just make sure you save your work before running this code, just in case you run out of RAM. Or, use someone else’s computer for testing ;)

Edit 1/2/14: John Myles White brought up a good point via Twitter about RAM usage:


Comments

  1. Nice hint! By the way, have you considered model.matrix(), which would allow coding interactions, also. Using the example object above, something along these lines should work:

    example<-cbind(example, model.matrix(~example$strcol-1))

    • Randy Zwitch says:

      Thanks for stopping by! You’re right, there are any number of ways to handle this problem. I’ve never used a model.matrix() before, so that’s something for me try out, it looks like it’s got some great additional functionality.

  2. `ifelse` is not going to be the most efficient approach here.

    `table`, `model.matrix`, and solutions using matrix indexing will be quite a bit faster than your current solution. See this Gist for some timings on a 100k-row data.frame: https://gist.github.com/mrdwab/8242632

    Honestly, the result from `table` was a bit surprising to me. There are many who complain that `table` in R is slow, but I felt it performed quite well over here.

    • Randy Zwitch says:

      Thanks for that benchmark. Between your work and the other commenter, it looks pretty clear that model.matrix() is the way to go.

      What is the “-1″ doing in the model.matrix code? I never would’ve guessed that “model.matrix(~example$strcol-1)” would be the syntax to create 1/0 columns from a character/Factor!

Leave a Reply

%d bloggers like this: