Lecture 07 - Fitting Over & Under
Rose / Thorn
Rose: prediction is different from causal inference
Thorn: could be clearer that this is all about prediction/have an example for prediction
Problems of Prediction
what function describes the data (fitting, compression)
what functions explains these points (causal inference)
what would happen if we changed the data (intervention)
what is the next observation from the same process (prediction)
prediction is the absence of intervention
prediction does not require causal inference
Leave-one-out cross-validation
- drop one point
- fit line to remaining
- predict dropped point
- repeat (1) with next point
- score is error on dropped
task you use to assess the expected predictive accuracy of a statistical procedure
score in: fit to the sample / score out: fit to prediction
LPPD (log posterior probability of observation) used for cross-validation because it includes the entire posterior
more flexible patterns generally perform better in sample and worse out of sample (at least for simple models)
Cross-Validation
for simple models (no hyperparameters), more parameters improves fit to sample BUT may reduce accuracy of predictions out of sample
accurate models trade off flexibility with overfitting
there’s usually an optimal flexibility
Regularization
regular means learning the important/regular features of the sample - not getting too excited by every datapoint
regularization improves models, where loo just compares models (can both be bad)
overfitting depends upon the priors
don’t be too excited about every point in the sample, because not every point in the sample is regular (not all points are representative)
skeptical priors regularize models/inference - have tighter variance that reduces flexibility
- downweights improbable values
skeptical priors improve model prediction - regularize so that models learn regular features and ignore irregular features
- there is such a thing as too tight priors for model prediction (unless you have a small sample size)
In sample gets worse with tighter priors, out of sample gets better with tighter priors
Regularizing priors -> for pure prediction uses, you can tune the prior using cross-validation
- causal inference uses science to choose priors
Prediction Penalty
For N points, cross-validation requires fitting N models
- feasible for few data points but for many data points gets unwieldy
Importance sampling (PSIS) and information criteria (WAIC) allow you to assess prediction penalty from one model posterior distribution (for predictive models)
WAIC, PSIS, cross-validation (CV) measure overfitting
- regularization manages overfitting
Causal inference is not addressed by measuring or addressing overfitting
these tools are addressing the performance of a predictive model, not a causal model
should not select causal models based on these values because they are not associated with causality
these are all predictive metrics
Model Mis-selection
Do not use predictive criteria (WAIC, PSIS, CV) to choose a causal estimate
Predictive criteria prefer confounds and colliders
- improve predictive accuracy
Outliers & Robust Regression
some points are more influential than others - ‘outliers’
outliers are information - don’t necessarily want to remove them
but they often have high leverage/weight because they are “surprising”
dropping outliers ignores the problem - predictions will still be bad
model is wrong, not the data
can quantify the influence of each point on the posterior distribution using cross-validation
can also use a mixture model/robust regression to address outliers
divorce rate example
Maine and Idaho are outliers in divorce/age relationship
quantify influence of outliers using PSIS k statistic or WAIC penalty term
unmodelled sources of variation cause outliers -> error distributions are not constant across the sample
assuming that the dataset has multiple error distributions, with the same mean but different variations indicates that you are using a student t-test
Gaussian distribution has extremely thin tails - very skeptical
student t distribution is much less skeptical, wider tails, much less influenced by outliers + more robust
data(WaffleDivorce)
<- WaffleDivorce
d
# model
<- list(
dat D = standardize(d$Divorce),
M = standardize(d$Marriage),
A = standardize(d$MedianAgeMarriage)
)
.3 <- quap(alist(
m5~ dnorm(mu, sigma),
D <- a + bM*M + bA*A,
mu ~ dnorm(0, 0.2),
a ~ dnorm(0, 0.5),
bM ~ dnorm(0, 0.5),
bA ~ dexp(1)
sigma data = dat)
),
.3t <- quap(alist(
m5~ dstudent(2, mu, sigma),
D <- a + bM*M + bA*A,
mu ~ dnorm(0, 0.2),
a ~ dnorm(0, 0.5),
bM ~ dnorm(0, 0.5),
bA ~ dexp(1)
sigma data = dat) ),
Robust Regressions
unobserved heterogeneity in sample -> mixture of Gaussian errors
- thicker tails means model is less surprised/more robust
hard to choose distribution of student t-test because extreme values are rare - can test multiple values and select based on that, reporting all after
student-t regression can be a good default for undertheorized domains
- because Gaussian distribution is so skeptical
Prediction
what is the next observation from the same process? = prediction
possible to make very good predictions without knowing causes
optimizing prediction does not reliably reveal causes