Temp | Solar.R |
---|---|
67 | 190 |
72 | 118 |
74 | 149 |
62 | 313 |
56 | NA |
66 | NA |
Curtin University
2025-04-15
Temp | Solar.R |
---|---|
67 | 190 |
72 | 118 |
74 | 149 |
62 | 313 |
56 | NA |
66 | NA |
x <- na.omit(airquality$Temp[which(is.na(airquality$Solar.R))])
y <- na.omit(airquality$Temp[which(!is.na(airquality$Solar.R))])
x_mean_diff <- (x - (sum(x) / length(x)))^2
sum_x_mean_diff <- sum(x_mean_diff)
s_x <- sqrt((1 / length(x)) * sum_x_mean_diff)
se_x <- s_x / sqrt(length(x))
y_mean_diff <- (y - (sum(y) / length(y)))^2
sum_y_mean_diff <- sum(y_mean_diff)
s_x <- sqrt((1 / length(y)) * sum_y_mean_diff)
se_y <- s_x / sqrt(length(y))
numerator <- mean(x) - mean(y)
denominator <- sqrt(se_x + se_y)
result <- numerator / denominator
var_missing <- airquality$Solar.R
var_interest <- airquality$Temp
which_missing <- which(is.na(var_missing))
which_complete <- which(!is.na(var_missing))
var_interest_miss <- var_interest[which_missing]
var_interest_complete <- var_interest[which_complete]
result <- t.test(var_interest_miss, y = var_interest_complete)
result
Welch Two Sample t-test
data: var_interest_miss and var_interest_complete
t = -0.98706, df = 6.2689, p-value = 0.3602
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17.669258 7.436381
sample estimates:
mean of x mean of y
73.00000 78.11644
var_missing <- airquality$Ozone
var_interest <- airquality$Wind
which_missing <- which(is.na(var_missing))
which_complete <- which(!is.na(var_missing))
var_interest_miss <- var_interest[which_missing]
var_interest_complete <- var_interest[which_complete]
result_ozone_wind <- t.test(var_interest_miss, y = var_interest_complete)
result_ozone_wind
Welch Two Sample t-test
data: var_interest_miss and var_interest_complete
t = 0.60911, df = 63.646, p-value = 0.5446
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.8999377 1.6893132
sample estimates:
mean of x mean of y
10.256757 9.862069
var_missing <- airquality$Ozone
var_interest <- airquality$Wind
which_missing <- which(is.na(var_missing))
which_complete <- which(!is.na(var_missing))
var_interest_miss <- var_interest[which_missing]
var_interest_complete <- var_interest[which_complete]
result_ozone_wind <- t.test(var_interest_miss, y = var_interest_complete)
# I just like DO IT ALL AGAIN??
var_missing <- airquality$Solar.R
var_interest <- airquality$Temp
which_missing <- which(is.na(var_missing))
which_complete <- which(!is.na(var_missing))
var_interest_miss <- var_interest[which_missing]
var_interest_complete <- var_interest[which_complete]
result_solar_temp <- t.test(var_interest_miss, y = var_interest_complete)
missingness_impact <- function(when_missing, is_different){
when_missing_index <- which(is.na(when_missing))
when_complete_index <- which(!is.na(when_missing))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result <- t.test(is_different_miss, y = is_different_complete)
result
}
Welch Two Sample t-test
data: is_different_miss and is_different_complete
t = -0.98706, df = 6.2689, p-value = 0.3602
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17.669258 7.436381
sample estimates:
mean of x mean of y
73.00000 78.11644
Welch Two Sample t-test
data: is_different_miss and is_different_complete
t = 0.026831, df = 60.447, p-value = 0.9787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.546847 3.643306
sample estimates:
mean of x mean of y
77.91892 77.87069
[1] 32
[1] 77
Trivialises functions
Functions should be relevant to you
Functions are expression they are ideas
Functions manage complexity
If you copy and paste the same code 3 times, write a function
is_different <- airquality$Temp
when_missing_index <- which(is.na(airquality$Ozone))
when_complete_index <- which(!is.na(airquality$Ozone))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
is_different <- airquality$Wind
when_missing_index <- which(is.na(airquality$Solar.R))
when_complete_index <- which(!is.na(airquality$Solar.R))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_solar_wind <- t.test(is_different_miss, is_different_complete)
If you re-read your code 3 times, write a function
is_different <- airquality$Temp
when_missing_index <- which(is.na(airquality$Ozone))
when_complete_index <- which(!is.na(airquality$Ozone))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
is_different <- airquality$Wind
when_missing_index <- which(is.na(airquality$Solar.R))
when_complete_index <- which(!is.na(airquality$Solar.R))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_solar_wind <- t.test(is_different_miss, is_different_complete)
Functions are tools for managing complexity
AKA as abstraction or abstracting away
So, what complexity do we want to manage?
What do we want to abstract away?
is_different <- airquality$Temp
when_missing_index <- which(is.na(airquality$Ozone))
when_complete_index <- which(!is.na(airquality$Ozone))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
Start with the bones
What am I interested in?
misstest <- function(Temp, Ozone){
is_different <- airquality$Temp
when_missing_index <- which(is.na(airquality$Ozone))
when_complete_index <- which(!is.na(airquality$Ozone))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
}
What do I name things?
misstest <- function(is_different, Ozone){
# is_different <- airquality$Temp
when_missing_index <- which(is.na(airquality$Ozone))
when_complete_index <- which(!is.na(airquality$Ozone))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
}
Naming things can be tricky; that’s OK
misstest <- function(is_different, when_missing){
# is_different <- airquality$Temp
# when_missing_index <- which(is.na(airquality$Ozone))
when_missing_index <- which(is.na(when_missing))
when_complete_index <- which(!is.na(when_missing))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
}
We return the last thing
misstest <- function(is_different, when_missing){
# is_different <- airquality$Temp
# when_missing_index <- which(is.na(airquality$Ozone))
when_missing_index <- which(is.na(when_missing))
when_complete_index <- which(!is.na(when_missing))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
result_ozone_temp
}
Cleaning up old lettuce (removing unused comments)
misstest <- function(is_different, when_missing){
# is_different <- airquality$Temp
# when_missing_index <- which(is.na(airquality$Ozone))
when_missing_index <- which(is.na(when_missing))
when_complete_index <- which(!is.na(when_missing))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
result_ozone_temp
}
Cleaning up old lettuce (removing unused comments)
misstest <- function(is_different, when_missing){
when_missing_index <- which(is.na(when_missing))
when_complete_index <- which(!is.na(when_missing))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
result_ozone_temp
}
Name the function something that evokes the action
missingness_impact <- function(is_different, when_missing){
when_missing_index <- which(is.na(when_missing))
when_complete_index <- which(!is.na(when_missing))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss, is_different_complete)
result_ozone_temp
}
You need to use the function
You need to use the function
Welch Two Sample t-test
data: is_different_miss and is_different_complete
t = 0.026831, df = 60.447, p-value = 0.9787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.546847 3.643306
sample estimates:
mean of x mean of y
77.91892 77.87069
And write the output to a variable
temp_difference_ozone_missing <- missingness_impact(
when_missing = airquality$Ozone,
is_different = airquality$Temp
)
temp_difference_ozone_missing
Welch Two Sample t-test
data: is_different_miss and is_different_complete
t = 0.026831, df = 60.447, p-value = 0.9787
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.546847 3.643306
sample estimates:
mean of x mean of y
77.91892 77.87069
(heard via Stat545 functions chapter)
blindly_functionalising <- function(){
is_different <- airquality$Temp
when_missing_index <- which(is.na(airquality$Ozone))
when_complete_index <- which(!is.na(airquality$Ozone))
is_different_miss <- is_different[when_missing_index]
is_different_complete <- is_different[when_complete_index]
result_ozone_temp <- t.test(is_different_miss,
is_different_complete)
}
DRY has benefits - but I think expression and complexity are more powerful
[W]e want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute.
— Structure and Interpretation of Computer Programs. Abelson, Sussman, and Sussman, 1984.
There are only two hard things in Computer Science: cache invalidation and naming things.
– Phil Karlton
Converting temperature?
The idea of inputs and outputs isn’t hard
What is hard it taking code, (like the code in a data analysis) and finding the parts that need to change
There’s a level of “I got it to work” and there’s a level of “It works, and I can reason about it”
– Joe Cheng You have to be able to reason about it | Data Science Hangout
…how do you take all this complexity and break it down into smaller pieces
each of which you can reason about
each of which you can hold in your head
each of which you can look at and be like “yup, I can fully ingest this entire function definition, I can read it line by line and prove to myself this is definitely correct…
So software engineering… is a lot about this: How do you break up inherently complicated things that we are trying to do into small pieces that are individually easy to reason about. That’s half the battle…
The other half of the battle is how do we combine them in ways that can be reliable and also easy to reason about
how do you take all this complexity and break it down into smaller pieces, each of which you can reason about, each of which you can hold in your head
Our working memory = 7 +/- 2 (5-9) chunks
“The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information” George A. Miller, Psychological Review, 1956, 63 (2)
Memory is not limited by bits, but by chunks
1-8-0-0-1-3-1-0-8-6
1800 131 086
So practice breaking code into chunks
50 lines of code
Is not 50 ideas
Chunk code into ideas
Reason with them
Find the complexity
Abstract complexity
browser()
# example function code here - something useful though
tidy_education <- function(data, names){
names <- trimws(names)
browser() # jump into this point in the function
# "n" Next line
tidied <-
# "s" Step into highlighted function
combined <- combine_custom_data(tidied, data)
# "f" Finish execution of loop or function
# "Q" Quit browser
# "help" - loads a list of helper code
}
The ambiguity [of the S language] is real and goes to a key objective: we wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important
convey_key_ideas()
browser()
or other debugging toolsnjtierney.github.io/funfun-curtin/
njtierney/funfun-curtin
nicholas.tierney@gmail.com
debug()
and friendsdebug(function_name)
browser()
into codefunction_name()
until end of sessionundebug(function_name)
debugonce(function_name)
debug(function_name)
onceoptions(error = recover)
: watch Miles McBain’s “stop() - breathe - recover()” videooptions(error = browser)
: Will drop a browser()
in your code when you hit an error. Kind of scary.options(error = NULL)
: turn off these special modesfnmate
: milesmcbain/fnmatetargets
: ropensci/targetsnjtierney.github.io/funfun-curtin/