Docker based RStudio & PostgreSQL

This is part one of the two part post related to Docker, PostgreSQL databases and Anomaly data-sets.

Background

In recent LinkedIn posts (original and Rami’s repost) and tweet, I asked the internet for their favorite datasets for anomaly detection problems, particularly in the time-series domain. I got lots of responses, and now have a massive amount of data to play with, thank you folks who responded.

To play with these data, I wanted a better solution than keeping data as CSV files, qs objects, or in R Packages. I wanted a database to store the raw input data, my processed data and algorithm results. While I’ve setup databases in the past the traditional way, this time I wanted the entire codebase to be portable & reproducible.

As a result, I setup everything using Docker and it works seemlessly. If you’d like to learn how to do so, follow along my next posts:

  • Part I - (this post) Will teach you how to setup a simple reproducible Docker based workflow for a personal PostgreSQL Database + RStudio for code development
  • Part II - (next post) Will be the ETL pipelines for anomaly data

Why should you read this?

At the end of this tutorial, you’ll be able to rapidly setup a Docker based personal1 PostgreSQL database. You will learn how to quickly deploy PostgreSQL & RStudio using docker-compose. You will be able to access the database in R and begin development immediately. And most importantly, the whole process will be fully reproducible as you inherit the benefits of setup scripts and Docker.

This tutorial assumes you’re familiar with Docker and RStudio in a Docker environment. If you’re not, I recommend reading Reproducible Work in R first.

Overview

You will be launching two Docker images:

  1. A PostgreSQL image. I choose postgres:13.3
  2. An RStudio image. I choose hatmatrix/blog:base 2

To permanently store your data beyond the life of the containers, you will mount two volumes, one for each container. I chose:

  1. For PostgreSQL: $HOME/docker/volumes/postgres
  2. For R Projects: $HOME/github

3

None of these paths, except for the in-container PostgreSQL are special; you can customize the others to your liking. By default, postgres:13.3 expects the database to be at /var/lib/postgresql/data. If you choose another database, modify this accordingly.

I use docker-compose to launch both PostgreSQL and RStudio services together. It’s convenient while also ensuring the PostgreSQL service runs first followed by RStudio. It’s easy to start or stop all the services using just a few commands.

First Time Setup

You need to run these steps the first time you’re setting up the PostgreSQL database. I’ve stored these steps in 00-postgres-init.sh.

1 - Directory Setup

You need a local directory to store the PostgreSQL database in. Lines 3-10 take care of this for you.

# create directory if does not exist
if [ -d "$HOME/docker/volumes/postgres" ] 
then
    echo "Directory $HOME/docker/volumes/postgres exists." 
else
    echo "Error: Directory $HOME/docker/volumes/postgres does not exists."
    mkdir -p $HOME/docker/volumes/postgres
fi

2 - PostgreSQL Setup

Now it’s time to setup the database. You need two steps at a minimum to get started:

  1. A new ‘role’ (akin to a login) with rights to create new databases.
  2. At least one database to work in. In my script, I’m making two: work and anomaly.

To manipulate the database, you need a PostgreSQL server running to process the psql commands. You’ll launch one using docker run. You need the correct volume mounted using -v. Next, we create the role and databases by piping psql commands into docker exec. Then, we stop the container.

# launch the postgres image called 'post_setup',
# attach it to the local volume
docker run --rm --name post_setup \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_PASSWORD=docker \
  -d \
  -p 5432:5432 \
  -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data \
  postgres:13.3

# create a new role, and two databases
echo "CREATE ROLE rahul WITH PASSWORD 'pass' CREATEDB LOGIN;
CREATE DATABASE work; CREATE DATABASE anomaly;" | \
 docker exec -i post_setup \
 psql -U postgres

# stop the docker container
docker stop post_setup

In summary, now I have a PostgresSQL database:

  • stored at $HOME/docker/volumes/postgres
  • with a new role rahul and password pass
  • with 2 databases: work and anomaly

Daily Workflow

tldr: How do you get going?

  1. Store docker-compose.yml in a local directory
  2. Modify it if you’ve changed my chosen images/directories
  3. In shell, run docker-compose up -d

protip: to launch a browser (firefox for me) directly into RStudio as well, run this command in the directory where you have docker-compose.yml:

docker-compose up -d; firefox localhost:8787

pro-protip: save an alias and generalize the command. The -f arg instructs docker-compose which file you’d like to use. Now engage can be run from anywhere in the system.

alias engage='docker-compose \
            -f $HOME/github/docker/docker-compose.yml \
            up -d; firefox localhost:8787' 

The breakdown

What’s in docker-compose.yml? We’re creating two services, one called db and the other rstudio.4

Let’s look at db first. Most of the arguments will look familiar if you’re familiar with docker run args. What’s new here is the restart: unless-stopped arg which tells Docker to only start PostgreSQL if it’s currently stopped.

version: "3.3"
services:
  db:
    image: postgres:13.3
    restart: unless-stopped
    environment:
      POSTGRES_DB: "anomaly"
      POSTGRES_USER: "rahul"
      POSTGRES_PASSWORD: "pass"
    ports:
      - "5432:5432"
    volumes:
      - type: "bind"
        source: "$HOME/docker/volumes/postgres"
        target: "/var/lib/postgresql/data"

The 2nd service is rstudio. Apart from the typical args, the interesting arg here is depends_on which tells Docker to only run this image after the database is up and running. Fantastic!

  rstudio:
    image: hatmatrix/blog:base
    ports:
      - "8787:8787"
      - "3838:3838"
    environment:
      DISABLE_AUTH: "true"
    volumes:
      - type: "bind"
        source: "$HOME/github"
        target: "/home/rstudio"
    depends_on:
      - "db"

Connecting via R

Use postgres.R to test your connection. Run your DBI:: commands you would normally, except for one key difference.

While making the connection, make sure the name of the host is the name of the database service you’ve chosen in docker-compose.yml. (Outside docker, you would have typically used localhost to connect to a local PostgreSQL server).

con <- DBI::dbConnect(
  drv = RPostgres::Postgres(),
  dbname = "anomaly",
  host = "db", # this needs to be the name of the postgres service
               # (line 3 in docker-compose.yml)
  user = "rahul",
  password = "pass",
  port = 5432
)

That’s it! You’re off to the races now. Use the DB as you normally would using {DBI}.

con %>% DBI::dbListTables()
con %>% dplyr::tbl("table_name")

To Stop Services

You have two options here:

  1. docker-compose stop will stop the services, which you can restart using docker-compose start.
  2. docker-compose down will and remove containers as well. Run docker-compose up to get going once again.

  1. I harp on ‘personal’ given I’m not setting up appropriate roles, auth etc needed in a work environment. But, it’s good enough for my personal use.↩︎

  2. This is my own image based off of rocker/rstudio↩︎

  3. DB icon by Pixel perfect
    Folder icon by Icongeek26
    RAM icon by Freepik
    Storage icon by Smashicons
    ↩︎
  4. These are just labels, you can call them what you’d like↩︎