Perf Benchmarking Dummy Variables - Part II

tl;dr

  • {stats} continues to dominate the speed tests
  • {fastDummies} had similar speeds only for dataframes with rows ~1M
  • {dummy} and {dummies} are the slowest

Motivation

In 2017, I compared the performance of four packages {stats}, {dummies}, {dummy} and {caret} to create dummy variables in this post.

Jacob Kaplan of UPenn has created a new package {fastdummies} which claims to be faster than other existing packages.

Let’s test it out.

Machine

I’m running these tests on a 2019 MacBook Pro running macOS Catalina (10.15.7) on a 2.4 GHz 8-Core Intel i9 with 32 MB 2400 MHz DDR4, in a docker container running:

platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          0.0                         
year           2020                        
month          04                          
day            24                          
svn rev        78286                       
language       R                           
version.string R version 4.0.0 (2020-04-24)
nickname       Arbor Day 

Perf Testing

A quick test

Create a test dataset…

NROW  <- 1e4
fac_levels <- c(4, 4, 5, 5, 7, 7, 9, 9)
input_data <- tibble::tibble(
    facVar_1 = as.factor(sample(LETTERS[1:fac_levels[1]], size = NROW, replace = TRUE)),
    facVar_2 = as.factor(sample(LETTERS[1:fac_levels[2]], size = NROW, replace = TRUE)),
    facVar_3 = as.factor(sample(LETTERS[1:fac_levels[3]], size = NROW, replace = TRUE)),
    facVar_4 = as.factor(sample(LETTERS[1:fac_levels[4]], size = NROW, replace = TRUE)),
    facVar_5 = as.factor(sample(LETTERS[1:fac_levels[5]], size = NROW, replace = TRUE)),
    facVar_6 = as.factor(sample(LETTERS[1:fac_levels[6]], size = NROW, replace = TRUE)),
    facVar_7 = as.factor(sample(LETTERS[1:fac_levels[7]], size = NROW, replace = TRUE)),
    facVar_8 = as.factor(sample(LETTERS[1:fac_levels[8]], size = NROW, replace = TRUE))
)
str(input_data)
## tibble [10,000 × 8] (S3: tbl_df/tbl/data.frame)
##  $ facVar_1: Factor w/ 4 levels "A","B","C","D": 3 1 3 3 3 3 3 3 3 4 ...
##  $ facVar_2: Factor w/ 4 levels "A","B","C","D": 3 2 3 3 2 2 3 1 2 3 ...
##  $ facVar_3: Factor w/ 5 levels "A","B","C","D",..: 4 2 3 4 1 4 2 5 3 3 ...
##  $ facVar_4: Factor w/ 5 levels "A","B","C","D",..: 1 2 2 2 5 5 2 4 2 3 ...
##  $ facVar_5: Factor w/ 7 levels "A","B","C","D",..: 4 7 7 5 6 2 5 7 4 4 ...
##  $ facVar_6: Factor w/ 7 levels "A","B","C","D",..: 1 2 2 5 2 6 7 3 7 2 ...
##  $ facVar_7: Factor w/ 9 levels "A","B","C","D",..: 4 2 5 1 3 9 1 4 5 8 ...
##  $ facVar_8: Factor w/ 9 levels "A","B","C","D",..: 4 6 9 8 6 3 5 8 4 2 ...

Run microbenchmark…

stats_fn <- function(dat) stats::model.matrix(~.-1,dat)
dummies_fn <- function(dat) dummies::dummy.data.frame(as.data.frame(dat))
dummy_fn <- function(dat) dummy::dummy(dat)
caret_fn <- function(dat) {caret::dummyVars(formula = ~.,data = dat) %>% predict(newdata = dat)}
fastDummies_fn <- function(dat) fastDummies::dummy_cols(dat)

microbenchmark::microbenchmark(
    stats =       stats_fn(input_data),
    dummies =     dummies_fn(input_data),
    dummy =       dummy_fn(input_data),
    caret =       caret_fn(input_data),
    fastDummies = fastDummies_fn(input_data),
    times = 10L
    ) %>% autoplot()

stats is still clearly the fastest of all the packages, for this moderately sized dataset.

Dig a bit deeper

How does the performance vary when rows, columns, or number of factors are scaled?

First, make some functions to create dataframes with varying rows/cols/levels per variable, run benchmarks & extract median execution times.

make_data <- function(NROW = 10, NCOL = 5, NFAC = 5){
    sapply(1:NCOL, 
           function(x) sample(LETTERS[1:NFAC], 
                              size = NROW, 
                              replace = TRUE)) %>% 
        as_tibble()
    
}
run_benchmark <- function(dat){
    microbenchmark::microbenchmark(
    stats =       stats_fn(dat),
    dummies =     dummies_fn(dat),
    dummy =       dummy_fn(dat),
    caret =       caret_fn(dat),
    fastDummies = fastDummies_fn(dat),
    times = 10L
    )
}
extract_median_time <- function(benchmarks){
    as_tibble(benchmarks) %>% 
        dplyr::group_by(expr) %>% 
        summarize(median_ms = median(time) * 1e-6)
}

make_data makes a pretty simple tibble:

make_data(NROW = 5, NCOL = 6, NFAC = 3)
## # A tibble: 5 x 6
##   V1    V2    V3    V4    V5    V6   
##   <chr> <chr> <chr> <chr> <chr> <chr>
## 1 A     B     C     A     C     A    
## 2 B     A     A     B     C     C    
## 3 A     C     C     B     B     A    
## 4 C     A     A     C     C     C    
## 5 C     A     C     B     C     C

How does performance scale by number of rows?

stats still rocks. With very large datasets, fastDummies approaches similar speed.

experiment_rows <- tibble::tibble(
    nrows = 10^(1:6)
    ) %>% 
    dplyr::mutate(input_data = purrr::map(nrows, ~make_data(NROW = .x, NCOL = 5, NFAC = 5)),
                  benchmarks = purrr::map(input_data, ~run_benchmark(.x)),
                  median_times = purrr::map(benchmarks, ~extract_median_time(.x)))
experiment_rows %>% 
    dplyr::select(nrows, median_times) %>%
    tidyr::unnest(cols = c(median_times)) %>%
    dplyr::rename(Package = expr) %>% 
    tidyr::pivot_wider(names_from = Package, values_from = median_ms) %>% 
    dplyr::mutate(
        dummies = dummies/stats,
        dummy = dummy/stats,
        caret = caret/stats,
        fastDummies = fastDummies/stats,
        stats = 1
    ) %>%
    tidyr::pivot_longer(-nrows) %>% 
    ggplot(aes(nrows, value, color = name)) +
    geom_line() +
    geom_point(aes(text = glue::glue("<b>{title}</b> {verb} {y}x", 
                                     title = name, 
                                     verb = ifelse(name == "stats", ":", "slower by"), 
                                     y = ifelse(value > 2,
                                            round(value),
                                            round(value, digits = 1))))) +
    scale_y_log10(labels = scales::label_number(accuracy = 1, suffix = "x")) +
    scale_x_log10(breaks = 10^(1:6), labels = scales::label_number_si()) +
    labs(x = "Number of Rows", y = "Relative Execution Rates", 
         title = "Row Performance (log-log scale)") -> p
ggplotly(p, tooltip = "text")

How does performance scale by number of columns?

stats is the clear winner here.

How does performance scale by number of levels?

Interestingly, number of levels per factor have little/no impact on performance for stats, caret and dummies. fastDummies & dummies show a positive correlation to levels.

Conclusion

See tl;dr