Performance Benchmarking for Dummy Variable Creation

R

How do the four popular methods of creating dummy variables perform on large datasets? Let’s find out!

Author

Rahul

Published

September 27, 2017

Motivation

Very recently, at work, we got into a discussion about creation of dummy variables in R code. We were dealing with a fairly large dataset of roughly 500,000 observations for roughly 120 predictor variables. Almost all of them were categorical variables, many of them with a fairly large number of factor levels (think 20-100). The types of models we needed to investigate required creation of dummy variables (think xgboost). There are a few ways to convert categoricals into dummy variables in R. However, I did not find any comparison of performance for large datasets.

So here it goes.

Why do we need dummy variables?

I won’t say any more here. Plenty of good resources on the web: here, here, and here.

Ways to create dummy variables in R

These are the methods I’ve found to create dummy variables in R. I’ve explored each of these

stats::model.matrix()

dummies::dummy.data.frame()

dummy::dummy()

caret::dummyVars()

Prepping some data to try these out. Using the HairEyeColor dataset as an example. It consists of 3 categorical vars and 1 numerical var. Perfect to try things out. Adding a response variable Y too.

dummy creates dummy variables of all the factors and character vectors in a data frame. It also supports settings in which the user only wants to compute dummies for the categorical values that were present in another data set. This is especially useful in the context of predictive modeling, in which the new (test) data has more or other categories than the training data. ^{1}

Some pros

Works with tibbles

Retains numerical columns as is

Can create based dummy variables for numeric columns too

p parameter can select terms in terms of frequency

Can grab only those variables in a separate dataframe

Can create based dummy variables for numeric columns too

Some cons

Doesn’t have a formula interface to specify what Y is. Need to manually remove response variable from dataframe

Lastly, there’s the caret package’s dummyVars(). This follows a different paradigm. First, we create reciepe of sorts, which just creates an object that specifies how the dataframe gets dummy-fied. Then, use the predict() to make the actual conversions.

Some pros

Works on creating full rank & less than full rank matrix post-conversion

Has a feature to keep only the level names in the final dummy columns

Can directly create a sparse matrix

Retains numerical columns as is

Some cons

Y needs a factor

If the cateogical variables aren’t factors, you can’t use the sep=' ' feature

I’ve run these benchmarks on my Macbook Pro with these specs:

Processor Name: Intel Core i5

Processor Speed: 2.4 GHz

Number of Processors: 1

Total Number of Cores: 2

L2 Cache (per Core): 256 KB

L3 Cache: 3 MB

Memory: 8 GB

Smaller datasets

The first dataset used is the HairEyeColor. 32 rows, 1 numeric var, 3 categorical var. All the resulting dataframes are as similar as possible… they all retain the Y variable at the end.

The results speak for themself. The stats is clearly the fastest with dummies and caret being a more distant 2nd & 3rd.

Large datasets

To leverage a large dataset for this analysis, I’m using the Accident & Traffic Flow dataset, which is fairly big - 570,011 rows and 33 columns. I’ve narrowed down to 7 categorical variables to test the packages, and I’ve created a fake response variable as well.