Better software, better research

IMAS, Hobart, Tasmania

Nicholas Tierney

Telethon Kids Institute

2024-01-31

The story so far

  • 2008-2012: Undergraduate + honours in Psychology

  • 2013 - 2017: PhD Statistics, QUT

    • Exploratory Data Analysis (EDA)
    • Bayesian / Geospatial statistics / Optimal placement
  • 2018 - 2020: Research Fellow / Lecturer at Monash

    • Design and improve tools for EDA
  • 2020 - Now: Research Software Engineer @ Telethon Kids Institute

    • Maintain and design tools for data analysis

Define: Research Software Engineer

A Research Software Engineer (RSE) combines professional software engineering expertise with an intimate understanding of research.

– (from https://society-rse.org/about/)

What sorts of things does an RSE do?

  • Create software to solve research problems

  • Develop tools that abstract the right components to facilitate research

  • Help researchers to find and learn good tools

  • Support researchers with (computational) reproducibility

  • (adapted from Heidi Seibold’s UseR2021 Keynote talk)

visdat::vis_dat(airquality)

naniar::gg_miss_var(airquality)

naniar::gg_miss_var(airquality, facet = Month)

naniar::gg_miss_upset(airquality)

brolgar - take spaghetti

brolgar - spread spaghetti

brolgar - identify spaghetti

maxcovr - cover facilities

maxcovr - cover facilities

greta is R code

stan

data {
  real alpha;
  real beta;
  real<lower=0> sigma2;
  int<lower=0> J;
  array[J] int y;
  vector[J] Z;
  array[J] int n;
}
transformed data {
  real<lower=0> sigma;
  sigma = sqrt(sigma2);
}
parameters {
  real theta1;
  real theta2;
  vector[J] X;
}
model {
  array[J] real p;
  theta1 ~ normal(0, 32); // 32^2 = 1024 
  theta2 ~ normal(0, 32);
  X ~ normal(alpha + beta * Z, sigma);
  y ~ binomial_logit(n, theta1 + theta2 * X);
}

JAGS

for(j in 1 : J) {
   y[j] ~ dbin(p[j], n[j])
   logit(p[j]) <- theta[1] + theta[2] * X[j]
   X[j] ~ dnorm(mu[j], tau)
   mu[j] <- alpha + beta * Z[j]
}
theta[1] ~ dnorm(0.0, 0.001)
theta[2] ~ dnorm(0.0, 0.001)

greta

theta <- normal(0, 32, dim = 2)
mu <- alpha + beta * Z
X <- normal(mu, sigma)
p <- ilogit(theta[1] + theta[2] * X)
distribution(y) <- binomial(n, p)

google tensorflow

  • automatic differentiation
  • efficient linear algebra
  • highly parallel

extendable

greta.gp

greta.gp extends greta to let you define Gaussian processes as part of your model. It provides a syntax to create and combine GP kernels, and use them to define either full rank or sparse Gaussian processes.

# kernel & GP
kernel <- rbf(rbf_len, rbf_var) + 
            bias(1)
f <- gp(x, kernel)
# likelihood
distribution(y) <- normal(f, obs_sd)
# prediction
f_plot <- project(f, x_plot)

why ‘greta’ ?

Grete Hermann (1901 - 1984)

wrote the first algorithms for computer algebra

… without a computer

(To avoid people saying ‘greet’, the package is spelled greta instead)

What greta looks like

\[ \alpha \sim Normal(0, 5) \]

\[ \beta \sim Normal(0, 3) \]

\[ \sigma \sim logNormal(0, 5) \] \[ \mu = \alpha + \beta X \]

\[ Y \sim Normal(\mu, \sigma) \]

x <- penguins$bill_length_mm
y <- penguins$flipper_length_mm
alpha <- normal(0,5)
beta <- normal(0,3)
sd <- lognormal(0,3)
mu <- alpha + coef * x
distribution(y) <- normal(mu, sd)
m <- model(mu, beta, sd)
draws <- mcmc(m)

IDEM group

Research pipelines

A best intentioned set of ordered scripts:

  • 00-setup.R
  • 01-read.R
  • 02-clean.R
  • 03-model.R

Can turn into a headache:

  • 00-setup.R
  • 01-read.R
  • 02-clean.R
  • 03-model.R
  • 03-model2A.R
  • 05-results.R
  • Did I run model already?
  • I made changes in “model2”, but do I need to re-run EVERYTHING?
  • Can I save some time and skip parts?

{targets} asks you to embrace a functional approach

clean_storms <- function(storms_raw) {
  storms_raw %>% 
    filter(year >= 2000)
}

model_storms <- function(storms_tidy) {
  lm(wind ~ pressure + year + month + lat + long,
     data = storms_tidy)
}

{targets} asks you to embrace a functional approach

source("./packages.R")
tar_source()

tar_plan(
  tar_file(storms_file, "data/storms.csv"),
  storms_raw = read_csv(storms_file),
  storms_tidy = clean_storms(storms_raw),
  storms_model = model_storms(storms_tidy),
  model_summary = summary(storms_model),
)

Research pipelines: Let’s iterate together?

https://github.com/njtierney/targets-storms

General R Question time

Take homes

  • functions are a way to express intent
  • targets can be used to save on iteration pain
  • software can be really useful

Thanks

  • Nick Golding
  • Saras Windecker
  • Di Cook
  • Miles McBain
  • Mike Sumner
  • Nicole Hill

Resources

  • visdat: github.com/ropensci/visdat
  • naniar: github.com/njtierney/naniar
  • brolgar: github.com/njtierney/brolgar
  • maxcovr: github.com/njtierney/maxcovr
  • greta: github.com/greta-dev/greta
  • targets: github.com/ropensci/targets

Colophon

End.