Homework - Week 04
Q1
Revisit the marriage, age, and happiness collider bias example from Chapter 6. Run models m6.9 and m6.10 again (pages 178–179). Compare these two models using both PSIS and WAIC. Which model is expected to make better predictions, according to these criteria, and which model yields the correct causal inference?
.9 <- tar_read(h04_q1a)
m6.10 <- tar_read(h04_q1b)
m6
.9 <- add_criterion(m6.9, criterion = "waic")
m6.9 <- add_criterion(m6.9, criterion = "loo")
m6
.10 <- add_criterion(m6.10, criterion = "waic")
m6.10 <- add_criterion(m6.10, criterion = "loo")
m6
loo_compare(m6.9, m6.10)
elpd_diff se_diff
m6.9 0.0 0.0
m6.10 -194.2 17.6
We see that the model with both age and marriage is the best predictor using both metrics. Causally, marriage status in this problem is a collider because it is a common consquence of age and happiness. Therefore, the model with just age is the causally correct model.
Q2
Reconsider the urban fox analysis from last week’s homework. On the basis of PSIS and WAIC scores, which combination of variables best predicts body weight (W, weight)? What causal interpretation can you assign each coefficient (parameter) from the best scoring model??
<- tar_read(h03_q2)
food <- tar_read(h03_q3)
food_groupsize <- tar_read(h04_q2)
food_groupsize_area
<- add_criterion(food, criterion = "waic")
food <- add_criterion(food, criterion = "loo")
food
<- add_criterion(food_groupsize, criterion = "waic")
food_groupsize <- add_criterion(food_groupsize, criterion = "loo")
food_groupsize
<- add_criterion(food_groupsize_area, criterion = "waic")
food_groupsize_area <- add_criterion(food_groupsize_area, criterion = "loo")
food_groupsize_area
loo_compare(food, food_groupsize, food_groupsize_area)
elpd_diff se_diff
food_groupsize_area 0.0 0.0
food_groupsize -0.3 1.7
food -5.2 3.4
For prediction, the best model is the model with food, group size, and area included (though it is very similar to the food + group size model). Food is a descendent of area, which means that conditioning on both means we are estimating the effect of area independent of food availability. In this model, we are only causally assessing the effect of food on weight, independent of area.
Q3 (optional)
The data in data(Dinosaurs) are body mass estimates at different estimated ages for six different dinosaur species. See ?Dinosaurs for more details. Choose one or more of these species (at least one, but as many as you like) and model its growth. To be precise: Make a predictive model of body mass using age as a predictor. Consider two or more model types for the function relating age to body mass and score each using PSIS and WAIC. Which model do you think is best, on predictive grounds? On scientific grounds? If your answers to these questions differ, why? This is a challenging exercise, because the data are so scarce. But it is also a realistic example, because people publish Nature papers with even less data. So do your best, and I look forward to seeing your growth curves.
Going to attempt to use the Von Bertanaffy growth equation in brms (source):
\[ L_T = L_0e^{-rt} + L_{max}(1 - e^{-rt}) \]
<- Dinosaurs %>%
soi filter(species == "Massospondylus carinatus")
## Model 1: Linear Model
<- tar_read(h04_q3a)
linear
<- linear %>%
epred epred_draws(soi)
ggplot(data = soi, aes(x = age)) +
stat_lineribbon(aes(x = age, y = .epred), data = epred) +
geom_jitter(aes(y = mass), width = .1, height = 0, alpha = 0.7) +
scale_fill_brewer(palette = "Greys") +
labs(y = 'Linear Model Predicted Mass (kg)', x = "Age (yrs)", fill = "Credible Interval") +
theme_classic()
## Model 2: Growth Curve
<- Dinosaurs %>%
lagged_obs filter(species == "Massospondylus carinatus") %>%
mutate(time_diff = age - lag(age),
sizelast = lag(mass)) %>%
slice(-1)
# Von Bertanaffy growth equation
<- bf(mass ~ sizelast * exp(- exp(logR) * time_diff) +
vb_formula * (1 - exp(-exp(logR) * time_diff)),
sizeMax ~ 1,
logR ~ 1, nl = TRUE)
sizeMax
<- tar_read(h04_q3b)
growth
<- growth %>%
epredg epred_draws(lagged_obs)
ggplot(data = lagged_obs, aes(x = age)) +
stat_lineribbon(aes(x = age, y = .epred), data = epredg) +
geom_jitter(aes(y = mass), width = .1, height = 0, alpha = 0.7) +
scale_fill_brewer(palette = "Greys") +
labs(y = 'Growth Model Predicted Mass (kg)', x = "Age (yrs)", fill = "Credible Interval") +
theme_classic()
## Model 3: Exponential
<- tar_read(h04_q3c)
exponential
<- exponential %>%
eprede epred_draws(soi)
ggplot(data = soi, aes(x = age)) +
stat_lineribbon(aes(x = age, y = .epred), data = eprede) +
geom_jitter(aes(y = mass), width = .1, height = 0, alpha = 0.7) +
scale_fill_brewer(palette = "Greys") +
labs(y = 'Exponential Model Predicted Mass (kg)', x = "Age (yrs)", fill = "Credible Interval") +
theme_classic()
# Compare performance
data.frame(model = c('linear', 'growth', 'exponential'),
waic = c(waic(linear)[[3]], waic(growth)[[3]], waic(exponential)[[3]]),
psis = c(loo(linear)[[1]][[3]], loo(growth)[[1]][[3]], loo(exponential)[[1]][[3]]))
model waic psis
1 linear -51.88081 105.51477
2 growth -46.54505 94.01374
3 exponential -52.33163 117.97171
Growth model performs best predictively. On scientific grounds, maybe the growth model as well since it is rooted in literature and an established equation?