Better software, better research

IMAS, Hobart, Tasmania

Nicholas Tierney

Telethon Kids Institute

2024-01-31

The story so far

2008-2012: Undergraduate + honours in Psychology
2013 - 2017: PhD Statistics, QUT
- Exploratory Data Analysis (EDA)
- Bayesian / Geospatial statistics / Optimal placement
2018 - 2020: Research Fellow / Lecturer at Monash
- Design and improve tools for EDA
2020 - Now: Research Software Engineer @ Telethon Kids Institute
- Maintain and design tools for data analysis

Define: Research Software Engineer

A Research Software Engineer (RSE) combines professional software engineering expertise with an intimate understanding of research.

– (from https://society-rse.org/about/)

What sorts of things does an RSE do?

Create software to solve research problems
Develop tools that abstract the right components to facilitate research
Help researchers to find and learn good tools
Support researchers with (computational) reproducibility
(adapted from Heidi Seibold’s UseR2021 Keynote talk)

`visdat::vis_dat(airquality)`

`naniar::gg_miss_var(airquality)`

`naniar::gg_miss_upset(airquality)`

`brolgar` - take spaghetti

`brolgar` - spread spaghetti

`brolgar` - identify spaghetti

`maxcovr` - cover facilities

greta is R code

stan

data {
  real alpha;
  real beta;
  real<lower=0> sigma2;
  int<lower=0> J;
  array[J] int y;
  vector[J] Z;
  array[J] int n;
}
transformed data {
  real<lower=0> sigma;
  sigma = sqrt(sigma2);
}
parameters {
  real theta1;
  real theta2;
  vector[J] X;
}
model {
  array[J] real p;
  theta1 ~ normal(0, 32); // 32^2 = 1024 
  theta2 ~ normal(0, 32);
  X ~ normal(alpha + beta * Z, sigma);
  y ~ binomial_logit(n, theta1 + theta2 * X);
}

JAGS

for(j in 1 : J) {
   y[j] ~ dbin(p[j], n[j])
   logit(p[j]) <- theta[1] + theta[2] * X[j]
   X[j] ~ dnorm(mu[j], tau)
   mu[j] <- alpha + beta * Z[j]
}
theta[1] ~ dnorm(0.0, 0.001)
theta[2] ~ dnorm(0.0, 0.001)

greta

theta <- normal(0, 32, dim = 2)
mu <- alpha + beta * Z
X <- normal(mu, sigma)
p <- ilogit(theta[1] + theta[2] * X)
distribution(y) <- binomial(n, p)

google tensorflow

automatic differentiation
efficient linear algebra
highly parallel

extendable

`greta.gp`

greta.gp extends greta to let you define Gaussian processes as part of your model. It provides a syntax to create and combine GP kernels, and use them to define either full rank or sparse Gaussian processes.

# kernel & GP
kernel <- rbf(rbf_len, rbf_var) + 
            bias(1)
f <- gp(x, kernel)
# likelihood
distribution(y) <- normal(f, obs_sd)
# prediction
f_plot <- project(f, x_plot)

why ‘greta’ ?

Grete Hermann (1901 - 1984)

wrote the first algorithms for computer algebra

… without a computer

(To avoid people saying ‘greet’, the package is spelled greta instead)

What greta looks like

\[ \alpha \sim Normal(0, 5) \]

\[ \beta \sim Normal(0, 3) \]

\[ \sigma \sim logNormal(0, 5) \] \[ \mu = \alpha + \beta X \]

\[ Y \sim Normal(\mu, \sigma) \]

x <- penguins$bill_length_mm
y <- penguins$flipper_length_mm
alpha <- normal(0,5)
beta <- normal(0,3)
sd <- lognormal(0,3)
mu <- alpha + coef * x
distribution(y) <- normal(mu, sd)
m <- model(mu, beta, sd)
draws <- mcmc(m)

IDEM group

Research pipelines

A best intentioned set of ordered scripts:

00-setup.R
01-read.R
02-clean.R
03-model.R

Can turn into a headache:

00-setup.R
01-read.R
02-clean.R
03-model.R
03-model2A.R
05-results.R

Did I run model already?
I made changes in “model2”, but do I need to re-run EVERYTHING?
Can I save some time and skip parts?

{targets} asks you to embrace a functional approach

clean_storms <- function(storms_raw) {
  storms_raw %>% 
    filter(year >= 2000)
}

model_storms <- function(storms_tidy) {
  lm(wind ~ pressure + year + month + lat + long,
     data = storms_tidy)
}

{targets} asks you to embrace a functional approach

source("./packages.R")
tar_source()

tar_plan(
  tar_file(storms_file, "data/storms.csv"),
  storms_raw = read_csv(storms_file),
  storms_tidy = clean_storms(storms_raw),
  storms_model = model_storms(storms_tidy),
  model_summary = summary(storms_model),
)

Research pipelines: Let’s iterate together?

https://github.com/njtierney/targets-storms

General R Question time

Take homes

functions are a way to express intent
targets can be used to save on iteration pain
software can be really useful

Thanks

Nick Golding
Saras Windecker
Di Cook
Miles McBain
Mike Sumner
Nicole Hill

Resources

visdat: github.com/ropensci/visdat
naniar: github.com/njtierney/naniar
brolgar: github.com/njtierney/brolgar
maxcovr: github.com/njtierney/maxcovr
greta: github.com/greta-dev/greta
targets: github.com/ropensci/targets

Colophon

Slides made using quarto
Colours taken + modified from lorikeet theme from ochRe njtierney/njt-talks

Better software, better research

The story so far

Define: Research Software Engineer

What sorts of things does an RSE do?

`visdat::vis_dat(airquality)`

`naniar::gg_miss_var(airquality)`

`naniar::gg_miss_var(airquality, facet = Month)`

`naniar::gg_miss_upset(airquality)`

`brolgar` - take spaghetti

`brolgar` - spread spaghetti

`brolgar` - identify spaghetti

`maxcovr` - cover facilities

`maxcovr` - cover facilities

greta is R code

google tensorflow

extendable

`greta.gp`

why ‘greta’ ?

What greta looks like

IDEM group

Research pipelines

A best intentioned set of ordered scripts:

Can turn into a headache:

{targets} asks you to embrace a functional approach

{targets} asks you to embrace a functional approach

Research pipelines: Let’s iterate together?

General R Question time

Take homes

Thanks

Resources

Colophon

End.

Better software, better research

The story so far

Define: Research Software Engineer

What sorts of things does an RSE do?

visdat::vis_dat(airquality)

naniar::gg_miss_var(airquality)

naniar::gg_miss_var(airquality, facet = Month)

naniar::gg_miss_upset(airquality)

brolgar - take spaghetti

brolgar - spread spaghetti

brolgar - identify spaghetti

maxcovr - cover facilities

maxcovr - cover facilities

greta is R code

google tensorflow

extendable

greta.gp

why ‘greta’ ?

What greta looks like

IDEM group

Research pipelines

A best intentioned set of ordered scripts:

Can turn into a headache:

{targets} asks you to embrace a functional approach

{targets} asks you to embrace a functional approach

Research pipelines: Let’s iterate together?

General R Question time

Take homes

Thanks

Resources

Colophon

End.

`visdat::vis_dat(airquality)`

`naniar::gg_miss_var(airquality)`

`naniar::gg_miss_var(airquality, facet = Month)`

`naniar::gg_miss_upset(airquality)`

`brolgar` - take spaghetti

`brolgar` - spread spaghetti

`brolgar` - identify spaghetti

`maxcovr` - cover facilities

`maxcovr` - cover facilities

`greta.gp`