EDA: R Package History

Visualization
EDA
Exploration of CRAN Package History
Author

Rahul

Published

July 31, 2022

Introduction

Welcome to the Exploratory Data Analysis of the CRAN historical data set. As you may already know, CRAN is a network of servers around the world which store code and documentation for the R packages over time. As of writing this EDA, CRAN had just over 18,000 packages available in it’s repository.

CRAN Website

Heads or Tails has done a great job of grabbing historical data, cleaning it up and preparing it for us R enthusiasts. Read about the approach he followed in his blogpost.

Initial Setup

Libraries

Code
# Data Manipulation
library(dplyr)
library(tidyr)
library(readr)
library(skimr)
library(purrr)
library(stringr)
library(urltools)
library(magrittr)

# Plots
library(ggplot2)
library(naniar)
library(packcircles)
library(ggridges)
if(!require(streamgraph))
  devtools::install_github("hrbrmstr/streamgraph", quiet = TRUE)
library(streamgraph)
library(patchwork)

# Tables
library(reactable)

# Settings
theme_set(theme_minimal(
  base_size = 14,
  base_family = "Menlo"))
theme_update(
  plot.title.position = "plot"
)

Read In

Let’s start be reading in the data. There are two CSV files in this dataset. From his dataset page:

  • cran_package_overview.csv: all R packages currently available through CRAN, with (usually) 1 row per package…
  • cran_package_history.csv: version history of virtually all packages in the previous table...
Code
hist_dt <- read_csv(
  "input/r-package-history-on-cran/cran_package_history.csv",
  col_types = cols(
    package = col_character(),
    version = col_character(),
    date = col_date(format = "%Y-%m-%d"),
    repository = col_character()
  )
)
ov_dt <- read_csv(
  "input/r-package-history-on-cran/cran_package_overview.csv",
  col_types = cols(
    package = col_character(),
    version = col_character(),
    depends = col_character(),
    imports = col_character(),
    license = col_character(),
    needs_compilation = col_logical(),
    author = col_character(),
    bug_reports = col_character(),
    url = col_character(),
    date_published = col_date(format = "%Y-%m-%d"),
    description = col_character(),
    title = col_character()
  )
)
glimpse(hist_dt, 80)
Rows: 119,464
Columns: 4
$ package    <chr> "A3", "A3", "A3", "AATtools", "ABACUS", "abbreviate", "abby…
$ version    <chr> "0.9.1", "0.9.2", "1.0.0", "0.0.1", "1.0.0", "0.1", "0.1", …
$ date       <date> 2013-02-07, 2013-03-26, 2015-08-16, 2020-06-14, 2019-09-20…
$ repository <chr> "Archive", "Archive", "CRAN", "CRAN", "CRAN", "CRAN", "Arch…
Code
glimpse(ov_dt, 80)
Rows: 18,388
Columns: 12
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", …
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", …
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file…
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues"…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "htt…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 201…
$ description       <chr> "Supplies tools for tabulating and analyzing the res…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics f…

Quick View

I love to take the first peek into a dataset with the amazing {skimr} package. We can see that we have the right data types set for all the columns, dates have been imported correctly.

We can see in the history data that the 1st package reported on CRAN was in 22 years ago on 1998-02-25! Furthermore, the overview tells us there’s a package {pack} last published/updated on 2008-09-08.

While there’s no missing data in the history dataset, there are a bunch of missing values in the overview dataset. Let’s explore this a bit more.

Code
skimr::skim(hist_dt)
Data summary
Name hist_dt
Number of rows 119464
Number of columns 4
_______________________
Column type frequency:
character 3
Date 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
package 0 1 2 32 0 18372 0
version 0 1 3 15 0 10074 0
repository 0 1 4 7 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 1998-02-25 2022-07-17 2018-01-22 7119
Code
skimr::skim(ov_dt)
Data summary
Name ov_dt
Number of rows 18388
Number of columns 12
_______________________
Column type frequency:
character 10
Date 1
logical 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
package 0 1.00 2 32 0 18371 0
version 0 1.00 3 14 0 2112 0
depends 4862 0.74 2 218 0 4289 0
imports 4026 0.78 2 573 0 12041 0
license 0 1.00 3 54 0 159 0
author 0 1.00 4 4096 0 15652 0
bug_reports 10760 0.41 11 81 0 7578 0
url 8426 0.54 4 466 0 9548 0
description 0 1.00 5 7372 0 18368 0
title 0 1.00 5 210 0 18306 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date_published 1 1 2008-09-08 2022-07-17 2021-03-22 2542

Variable type: logical

skim_variable n_missing complete_rate mean count
needs_compilation 0 1 0.24 FAL: 13991, TRU: 4397

Data Quality

My favorite way of exploring missing data is to make it visible, using Nick Tierney’s amazing {naniar} package. There are a few columns with missing data. Let’s look at these more closely.

  • depends and imports have roughly a quarter of the data as <NA>. These are packages which roughly have no external dependencies. The difference between the two can get a bit complex; best to learn about it in Hadley’s chapter here.
  • Roughly half of bug_reports and url are missing. These don’t seem to be data issues as much as authors who don’t have a place to issue bugs or website for their package respectively.
  • date_published has only 1 row with missing, which seems like a data quality spill.
Code
ov_dt %>% 
  dplyr::arrange(date_published) %>% 
  vis_miss()

Interesting Questions

Since this is an open ended exploration - unlike other EDA with the purpose of building a predictive model - before I continue to the plotting, I’d like to posit some questions which will guide the flow of further work. The first five questions are from Martin’s blog, with further questions which I think would be interesting to explore.

  1. How long did packages take from their first release to version 1.0?
  2. Which packages have had the most version updates?
  3. What type of packages were most frequent in different years?
  4. Who are the most productive authors?
  5. Can you predict the growth toward 2025?
  6. How many packages use all CAPS, all small, or a mixture?
  7. Is there any temporal patterns to when versions are submitted to CRAN?
  8. Do authors use minor versions?
  9. What license is most used? Has there been a change over time? done
  10. How have the dependencies & imports changed over time? done
  11. Which repositories do packages use? Github/Bitbucket etc. How do these vary over time? done
  12. Do packages have URLs for bug reports? done
  13. Have titles & descriptions gotten longer over time? done

Feature Development

To aid answering many of these, I first need to create a few new features in the overview data set.

Read about the feature development in the tabs below. We go from 12 columns to 29 columns in the overview data set.

Version Numbers

Per the R package section in Hadley’s book, “an R package version is a sequence of at least two integers separated by either . or -. For example, 1.0 and 0.9.1-10 are valid versions, but 1 and 1.0-devel are not”. Typically, packages do follow the three number format of <major>.<minor>.<patch>. I’m making an assumption this is true, just to simplify things. I have a feeling it’ll capture most of the cases.

This feature could help answer questions about version number progressions.

Adding 3 columns here…

Code
split_versions <- function(dat) {
  stopifnot("version" %in% names(dat))
  
  dat %>%
    separate(
      version,
      into =
        c("major", "minor", "patch"),
      sep = "\\.",
      extra = "merge", # for versions like 1.0.3-3000, keep the '3-3000' together in the 3rd col
      fill = "right",
      remove = FALSE
    )
}

ov_dt <- ov_dt %>% split_versions()
hist_dt <- hist_dt %>% split_versions()
glimpse(ov_dt, 100)
Rows: 18,388
Columns: 15
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major             <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor             <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch             <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description       <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…

Dependies & Imports

For the last published version of the package, how many dependencies and/and imports does each package have? My hypothesis is that packages in the past relied on lesser dependencies since they were more likely than not written in base R. With the recent explosion of adoption of R, and the adoption of the tidyverse framework, more recent packages would have a larger set of dependencies.

Adding 2 columns here…

Code
ov_dt <- ov_dt %>% 
  mutate(
    # Dependencies
    num_dep = purrr::map_int(
      .x = depends,
      .f = function(x){
        x %>% 
          stringr::str_split(",", simplify = TRUE) %>% 
          length()
      }
    ),
    num_dep = ifelse(is.na(depends), 0, num_dep),
    # Imports
    num_imports = purrr::map_int(
      .x = imports,
      .f = function(x){
        x %>% 
          stringr::str_split(",", simplify = TRUE) %>% 
          length()
      }
    ),
    num_imports = ifelse(is.na(imports), 0, num_imports)
  )
glimpse(ov_dt, 100)
Rows: 18,388
Columns: 17
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major             <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor             <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch             <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description       <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep           <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports       <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …

Authors

How many authors did the latest publish have? Perhaps this could provide some insights into if package authors are collaborating more than they used to. Is the R community working together?

Adding 1 column here…

Code
ov_dt <- ov_dt %>% 
  mutate(
    num_authors = purrr::map_int(
      .x = author,
      .f = function(x){
        x %>% 
          stringr::str_split(",", simplify = TRUE) %>% 
          length()
      }
    )
  )
glimpse(ov_dt, 100)
Rows: 18,388
Columns: 18
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major             <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor             <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch             <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description       <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep           <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports       <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors       <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…

Temporal

Temporal features typically useful for aggregation downstream.

Adding 6 columns here…

Code
hist_dt <- hist_dt %>% 
  mutate(
    year = lubridate::year(date),
    month = lubridate::month(date, label = TRUE),
    day = lubridate::day(date),
    wday = lubridate::wday(date, label = TRUE),
    yr_mon = sprintf("%d-%s", year, month),
    dt = lubridate::ym(paste0(year, "-", month))
  )
ov_dt <- ov_dt %>% 
  filter(!is.na(date_published)) %>%
  mutate(
    year = lubridate::year(date_published),
    month = lubridate::month(date_published, label = TRUE),
    day = lubridate::day(date_published),
    wday = lubridate::wday(date_published, label = TRUE),
    yr_mon = sprintf("%d-%s", year, month),
    dt = lubridate::ym(paste0(year, "-", month))
  )
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 24
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major             <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor             <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch             <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description       <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep           <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports       <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors       <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year              <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month             <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day               <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday              <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon            <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt                <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…

Titles & Descriptions

How long are the titles and description fields in the latest package submissions? Any interesting trends over time?

Adding 2 columns here…

Code
ov_dt <- ov_dt %>%
  mutate(
    len_title = purrr::map_int(title, ~ stringr::str_count(.x, "\\w+")),
    len_desc = purrr::map_int(description, ~ stringr::str_count(.x, "\\w+"))
  )
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 26
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major             <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor             <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch             <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description       <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep           <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports       <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors       <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year              <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month             <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day               <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday              <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon            <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt                <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title         <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc          <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…

Licenses

The raw dataset has 159 unique levels for the license variable.

Code
ov_dt %>% 
  count(license) %>% 
  reactable(compact = TRUE, defaultSorted = "n", defaultSortOrder = "desc")

But many of being quite similar to each other, some binning is in order to extract some patterns. Here, I use the case_when to bin together similar licenses. (I’m no expert in these licenses. I’m sure I’m taking some liberties in the grouping here).

Adding 1 column here…

Code
ov_dt <- ov_dt %>% 
  mutate(
    license_cleaned = case_when(
      str_detect(license, "^GPL-3") ~ "GPL-3",
      str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*3") ~ "GPL-3",
      str_detect(license, "^GPL-2") ~ "GPL-2",
      str_detect(license, "^GPL\\s\\([\\s\\d\\.<=>]*2") ~ "GPL-2",
      str_detect(license, "^AGPL") ~ "AGPL",
      str_detect(license, "^LGPL") ~ "LGPL",
      str_detect(license, "Apache") ~ "Apache",
      str_detect(license, "BSD") ~ "BSD",
      str_detect(license, "LGPL") ~ "LGPL",
      str_detect(license, "MIT") ~ "MIT",
      str_detect(license, "CC0") ~ "CC0",
      license == "GPL" ~ "GPL",
      TRUE ~ "Other"
      # Left these out after some trials with plots below:
      # str_detect(license, "GNU") ~ "GNU", 
      # str_detect(license, "MPL") ~ "MPL",
      # str_detect(license, "Unlimited") ~ "Unlimited",
      # str_detect(license, "^CC") ~ "CC",
      )
  )
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 27
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major             <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor             <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch             <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description       <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep           <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports       <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors       <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year              <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month             <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day               <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday              <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon            <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt                <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title         <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc          <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
$ license_cleaned   <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…

Domains

Which domains do package authors typically use? My guess is GitHub rules them all, but is that true? Can we see any rise of other offerings like GitLab or BitBucket?

Adding 2 columns here…

Code
ov_dt <- ov_dt %>%
  mutate(url_domain = map_chr(url,
                              ~ {
                                if (is.na(.x))
                                  return(NA)
                                else
                                  return(url_parse(.x)$domain)
                              }),
         bug_domain = map_chr(bug_reports,
                              ~ {
                                if (is.na(.x))
                                  return(NA)
                                else
                                  return(url_parse(.x)$domain)
                              }))
glimpse(ov_dt, 100)
Rows: 18,387
Columns: 29
$ package           <chr> "A3", "AATtools", "ABACUS", "abbreviate", "abbyyR", "abc", "abc.data", "…
$ version           <chr> "1.0.0", "0.0.1", "1.0.0", "0.1", "0.5.5", "2.2.1", "1.0", "0.9.0", "1.0…
$ major             <chr> "1", "0", "1", "0", "0", "2", "1", "0", "1", "1", "0", "0", "1", "1", "1…
$ minor             <chr> "0", "0", "0", "1", "5", "2", "0", "9", "0", "2", "3", "15", "2", "0", "…
$ patch             <chr> "0", "1", "0", NA, "5", "1", NA, "0", NA, "1", "0", "0", NA, "3", "3", N…
$ depends           <chr> "R (>= 2.15.0), xtable, pbapply", "R (>= 3.6.0)", "R (>= 3.1.0)", NA, "R…
$ imports           <chr> NA, "magrittr, dplyr, doParallel, foreach", "ggplot2 (>= 3.1.0), shiny (…
$ license           <chr> "GPL (>= 2)", "GPL-3", "GPL-3", "GPL-3", "MIT + file LICENSE", "GPL (>= …
$ needs_compilation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRU…
$ author            <chr> "Scott Fortmann-Roe", "Sercan Kahveci [aut, cre]", "Mintu Nath [aut, cre…
$ bug_reports       <chr> NA, "https://github.com/Spiritspeak/AATtools/issues", NA, NA, "http://gi…
$ url               <chr> NA, NA, "https://shiny.abdn.ac.uk/Stats/apps/", "https://github.com/sigb…
$ date_published    <date> 2015-08-16, 2020-06-14, 2019-09-20, 2021-12-14, 2019-06-25, 2022-05-19,…
$ description       <chr> "Supplies tools for tabulating and analyzing the results of predictive m…
$ title             <chr> "Accurate, Adaptable, and Accessible Error Metrics for Predictive\nModel…
$ num_dep           <dbl> 3, 1, 1, 0, 1, 6, 1, 1, 0, 1, 1, 0, 1, 0, 6, 5, 0, 0, 1, 0, 0, 1, 1, 1, …
$ num_imports       <dbl> 0, 4, 3, 0, 6, 0, 0, 3, 1, 1, 2, 4, 0, 1, 0, 0, 1, 0, 4, 4, 3, 2, 0, 8, …
$ num_authors       <int> 1, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3, 3, 4, 5, 4, 2, 2, 3, 14, 4, 3, 1, 6, 6,…
$ year              <dbl> 2015, 2020, 2019, 2021, 2019, 2022, 2015, 2016, 2019, 2017, 2022, 2017, …
$ month             <ord> Aug, Jun, Sep, Dec, Jun, May, May, Oct, Nov, Mar, May, Nov, Feb, May, Ju…
$ day               <int> 16, 14, 20, 14, 25, 19, 5, 20, 13, 13, 28, 6, 4, 28, 17, 3, 20, 30, 22, …
$ wday              <ord> Sun, Sun, Fri, Tue, Tue, Thu, Tue, Thu, Wed, Mon, Sat, Mon, Thu, Thu, Tu…
$ yr_mon            <chr> "2015-Aug", "2020-Jun", "2019-Sep", "2021-Dec", "2019-Jun", "2022-May", …
$ dt                <date> 2015-08-01, 2020-06-01, 2019-09-01, 2021-12-01, 2019-06-01, 2022-05-01,…
$ len_title         <int> 9, 9, 8, 3, 8, 6, 8, 6, 9, 3, 5, 7, 7, 7, 4, 5, 5, 3, 4, 4, 5, 3, 8, 11,…
$ len_desc          <int> 28, 24, 40, 21, 52, 36, 11, 32, 41, 163, 19, 56, 33, 89, 12, 25, 88, 65,…
$ license_cleaned   <chr> "GPL-2", "GPL-3", "GPL-3", "GPL-3", "MIT", "GPL-3", "GPL-3", "GPL-3", "G…
$ url_domain        <chr> NA, NA, "shiny.abdn.ac.uk", "github.com", "github.com", NA, NA, NA, NA, …
$ bug_domain        <chr> NA, "github.com", NA, NA, "github.com", NA, NA, NA, NA, NA, "github.com"…

Graphical EDA

Now that I have the data sets prepared and ready, it’s time for the fun part - being creative and creating some interesting visuals! Let’s attack those questions one at a time.

Package Dependencies

Q: How have the dependencies & imports changed over time?

Here’s a time series plot of the number of imports and dependencies since 2008. Since the values are integers, adding some jitter adds some much needed separation in the individual values.

The data supports my hypothesis that more recent packages would have a larger set of dependencies. We’re currently at a median of 6 imports. But look at the explosion of package dependencies in the recent past!

Also, what happens mid-2015? There’s a clear elbow in the trend right at that time. It doesn’t seem organic; did some popular package get released which others used as dependenies? did ÇRAN’s measurement system change?

Code
plot_dotplot_ts <- function(dat,
                            title,
                            plotmean = FALSE,
                            dday = 90,
                            pcutoff = 0.95,
                            size = 1,
                            alpha = 0.06){
  
  stopifnot(all(c("dt", "date_published", "y", "median_y") %in% names(dat)))
  p95 <- quantile(dat$y, pcutoff)
  
  plot_mean <- function(g, dat, plotmean) {
    if (plotmean)
      g <- g +
        geom_smooth(aes(y = mean_y), color = "blue") +
        annotate(
          "text",
          x = max(dat$date_published) + lubridate::ddays(dday),
          y = max(dat$mean_y),
          label = sprintf("Mean: %0.0f", max(dat$mean_y, na.rm = TRUE)),
          vjust = 0.5,
          hjust = 0,
          color = "blue"
        )
    g
  }
  
  g <- dat %>%
    ggplot(aes(x = dt)) +
    geom_jitter(aes(y = y), alpha = alpha, size = size) 
  g <- g %>% plot_mean(dat, plotmean) 
  g +
    geom_smooth(aes(y = median_y), color = "red") +
    scale_x_date(
      date_breaks = "1 year",
      date_labels = "'%y",
      minor_breaks = NULL,
      expand = c(NA, 0.2)
    ) +
    scale_y_continuous(minor_breaks = NULL, limits = c(NA, p95)) +
    theme(
      panel.grid = element_blank(),
      axis.ticks.x.bottom = element_line(size = 0.8, colour = "gray"),
      axis.ticks.y.left = element_line(size = 0.8, colour = "gray"),
      plot.margin = margin(0, 50, 0, 50)
    ) +
    labs(title = title,
         caption = sprintf("Y axis clipped at %0.2f percentile", pcutoff),
         x = NULL, y = NULL) +
    coord_cartesian(clip = "off") +
    annotate(
      "text",
      x = max(dat$date_published) + lubridate::ddays(dday),
      y = max(dat$median_y),
      label = sprintf("Med: %d", max(dat$median_y)),
      vjust = 0.5,
      hjust = 0,
      color = "red"
    )
}

ov_dt %>% 
  select(date_published, dt, y = num_imports) %>% 
  timetk::pad_by_time(date_published, .pad_value = 0) %>% 
  mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>% 
  group_by(dt) %>%
  mutate(median_y = median(y)) %>%
  plot_dotplot_ts(
    title = "How have package `imports` changed over time?",
    dday = 90
  ) -> p1

ov_dt %>% 
  select(date_published, dt, y = num_dep) %>% 
  timetk::pad_by_time(date_published, .pad_value = 0) %>% 
  mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>% 
  group_by(dt) %>%
  mutate(median_y = median(y))%>%
  plot_dotplot_ts(
    title = "How have package `dependencies` changed over time?",
    dday = 90
  ) -> p2

p1/p2

Titles & Description

Interestingly, titles have settled down at a median length of 7, while mean & median description lengths have been monotonically increasing since 2015. Looks like folks have been putting in more effort to describe their packages in an effort to attract R users.

Code
ov_dt %>% 
  select(date_published, dt, y = len_title) %>% 
  timetk::pad_by_time(date_published, .pad_value = 0) %>% 
  mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>% 
  group_by(dt) %>%
  mutate(median_y = median(y))%>%
  plot_dotplot_ts(
    title = "Title Lengths",
    dday = 90
  ) -> p1

ov_dt %>% 
  select(date_published, dt, y = len_desc) %>% 
  timetk::pad_by_time(date_published, .pad_value = 0) %>% 
  mutate(dt = lubridate::ym(paste0(lubridate::year(date_published), "-", lubridate::month(date_published)))) %>% 
  group_by(dt) %>%
  mutate(median_y = median(y),
         mean_y = mean(y))%>%
  plot_dotplot_ts(
    title = "Description Lengths",
    dday = 80,
    plotmean = TRUE
  ) -> p2

p1/p2

Another interesting way to look at the same data is by ridge plots using {ggridges}. It’s easy to see the large spread of descriptions and how it’s been increasing over time.

Code
deps <- ov_dt %>%
  select(year,
         `Description Length` = len_desc,
         `Title Length` = len_title) %>%
  arrange(-year) %>%
  filter(!is.na(year), year > 2008) %>%
  mutate(year = factor(year, levels = seq(2008, 2022)))
deps %>%
  pivot_longer(-year) %>%
  ggplot(aes(y = year, x = value, fill = name)) +
  stat_density_ridges(
    bandwidth = 4,
    scale = .95,
    quantile_lines = TRUE,
    quantiles = 2,
    alpha = 0.7,
    rel_min_height = 0.01
  ) +
  scale_x_continuous(limits = c(0, 200), expand = c(0, 0)) +
  coord_cartesian(clip = "off") +
  theme_ridges(center = TRUE) +
  theme(legend.position = "top",
        legend.title = element_blank()) +
  labs(
    title = "Distribution of Description & Title Lengths since 2010",
    x = NULL,
    y = NULL
  )

Licenses

Q: What license is most used? Has there been a change over time?

While barcharts are certainly the go-to to compare frequencies across categories, bubble charts are a fun way to visualize categories. GPL licenses are certainly the most popular license category being used today.

Code
plot_bubbles <- function(dat,
                         .scale,
                         plot_radius,
                         bubble_radius,
                         textsize,
                         alpha,
                         color,
                         maxiter) {
  .qty <- nrow(dat)
  
  theta <- seq(0, 360, length.out = .qty + 1)
  
  dat$x <- plot_radius * cos(theta * pi / 180)[-1]
  dat$y <- plot_radius * sin(theta * pi / 180)[-1]
  dat$n_scaled <- dat$n / .scale
  
  xpack <- rep(dat$x, times = dat$n_scaled)
  ypack <- rep(dat$y, times = dat$n_scaled)
  
  coords <- tibble(
    x = xpack + runif(length(xpack)),
    y = ypack + runif(length(ypack)),
    r = bubble_radius,
    label = rep(dat$label, times = dat$n_scaled)
  )
  
  packed_coords <-
    circleRepelLayout(coords, sizetype = "r", maxiter = maxiter)
  
  packed_coords$layout %>%
    ggplot(aes(x, y)) +
    geom_point(aes(size = radius, color = coords$label), alpha = alpha) +
    coord_equal() +
    theme_minimal() +
    theme(
      legend.position = "none",
      panel.grid = element_blank(),
      axis.title = element_blank(),
      axis.text = element_blank()
    ) +
    geom_text(
      aes(
        x = x,
        y = y,
        label = label
      ),
      color = color,
      size = textsize,
      fontface = "bold",
      data = dat,
      hjust = "center",
      vjust = "center"
    )
}

set.seed(31415)
ov_dt %>%
  count(license_cleaned) %>%
  top_n(8, n) %>%
  mutate(label = sprintf(
    "%s\n%s",
    license_cleaned,
    scales::label_comma(suffix = " Pkgs")(n)
  )) %>%
  arrange(runif(1:n())) %>%
  plot_bubbles(
    .scale = 50,
    plot_radius = 8,
    bubble_radius = .32,
    textsize = 8,
    alpha = 0.4,
    color = "gray30",
    maxiter = 1000
  )

While those bubble charts are nice to look at, a standard issue bar chart is the clear winner to visualize the relative distributions of the categories, especially the cliff event after the top 3 license types.

Code
ov_dt %>% 
  count(license_cleaned) %>% 
  ggplot(aes(x = forcats::fct_reorder(license_cleaned, n), y = n, fill = license_cleaned)) +
  geom_col() +
  coord_flip() +
  theme_minimal() +
  guides(fill = "none") +
  labs(x = NULL, y = NULL, title = "Which license is the most popular?")

GPL-2, GPL-3 and MIT licenses seem to have the same adoption rate over time. Again, we see the sudden shifts in data points available after mid-2015.

Code
ov_dt %>% count(license_cleaned) %>% arrange(-n) -> lic_n
ov_dt %>% 
  group_by(dt) %>% 
  count(license_cleaned) %>% 
  mutate(license_cleaned = factor(license_cleaned, levels = lic_n$license_cleaned)) %>% 
  ggplot(aes(x= dt, y = n, color = license_cleaned)) +
  geom_smooth(span = 0.4, se = FALSE, show.legend = FALSE, size = .3) +
  geom_jitter(alpha = 0.8) +
  theme_light() +
  theme(
    legend.title = element_blank()
  ) +
  labs(
    x = NULL,
    y = "Count",
    title = "Adoption of Licenses"
  )

A striking way to visualize time-series data is using streamgraphs. The {streamgraph} package is an R API to access the D3 library to make these beautiful plots. While the previous time-series plot is miminal & most definitely allows a user to quickly extract a change in trend and comparison between the groups, this streamgraph is just a joy to look at.

Here, I’m plotting the series from 2015 to Jun-2022.

Code
ov_dt %>% 
  group_by(dt) %>% 
  count(license_cleaned) %>%
  ungroup() %>%
  timetk::pad_by_time(.date_var = dt, .by = "month", .pad_value = 0) %>%
  group_by(dt) %>%
  mutate(license_cleaned = factor(license_cleaned, levels = lic_n$license_cleaned),
         val = sum(n),
         pc = n / val) -> pdat

pdat %>% 
  filter(dt > "2015-01-01", dt < "2022-06-30") %>% 
  streamgraph(license_cleaned, n, dt)  %>%
  sg_axis_x(1, "year", "%Y") %>%
  sg_legend(show=TRUE, label="License: ") %>%
  sg_fill_brewer("Spectral") %>%
  sg_title("License Adoption since 2015")
License Adoption since 2015

URL Domain

Q: Which domains do packages use for their website?

GitHub is definitely the preferred choice by a majority of the packages, followed by ROpenSci, and CRAN-RProject. Note: Here, I’m grouping all the remaining levels, where counds are < 50 in ‘Other’.

Code
ov_dt %>% 
  filter(url_domain != "") %>% 
  mutate(url_domain = forcats::fct_lump_min(url_domain, 50)) %>% 
  count(url_domain) %>% 
  mutate(label = sprintf("%s\n%s", url_domain, scales::label_comma()(n))) %>% 
  arrange(runif(1:n())) %>% 
  plot_bubbles(
     .scale = 50,
    plot_radius = 5,
    bubble_radius = .3,
    textsize = 6,
    alpha = 0.4,
    color = "gray30",
    maxiter = 1300
  )

A similar story when seen over time.

Code
ov_dt %>%
  filter(url_domain != "") %>%
  mutate(url_domain = forcats::fct_lump_min(url_domain, 50)) %>%
  group_by(dt) %>%
  count(url_domain) %>%
  ggplot(aes(x = dt, y = n, color = url_domain)) +
  geom_smooth(
    span = 0.4,
    se = FALSE,
    show.legend = FALSE,
    size = .3
  ) +
  geom_jitter(alpha = 0.8) +
  theme_light() +
  theme(legend.title = element_blank()) +
  labs(x = NULL,
       y = "Count",
       title = "Domains used for Websites")

Bug Reporting Domain

Q: Which domain do packages use for reporting bugs / filling feature requests? How do these vary over time?

This one’s not even close. GitHub is the dominant player by a long shot.

Code
ov_dt %>% 
  filter(bug_domain != "") %>% 
  mutate(bug_domain = forcats::fct_lump_min(bug_domain, 20)) %>% 
  count(bug_domain) %>% 
  mutate(label = sprintf("%s\n%s", bug_domain, scales::label_comma()(n))) %>% 
  arrange(runif(1:n())) %>% 
  plot_bubbles(
     .scale = 35,
    plot_radius = 5,
    bubble_radius = .3,
    textsize = 6,
    alpha = 0.4,
    color = "gray30",
    maxiter = 1300
  )

And it has been from the start.

Code
ov_dt %>%
  filter(bug_domain != "") %>%
  mutate(bug_domain = forcats::fct_lump_min(bug_domain, 20)) %>%
  group_by(dt) %>%
  count(bug_domain) %>%
  ggplot(aes(x = dt, y = n, color = bug_domain)) +
  geom_smooth(
    span = 0.4,
    se = FALSE,
    show.legend = FALSE,
    size = .3
  ) +
  geom_jitter(alpha = 0.8) +
  theme_light() +
  theme(legend.title = element_blank()) +
  labs(x = NULL,
       y = "Count",
       title = "Domains used for Bug Reporting")


That’s it for now. I might update this EDA as I find time and think of more questions to explore.

Thanks for reading!


Subscribe to my newsletter!