Lecture 14 - Correlated Features
Rose / Thorn
Rose:
Thorn:
Multivariate Prior
\(C_i \sim Bernoulli(p_i)\)
\(logit(p_i) = \alpha_{D[i]} + \beta_{D[i]}U_i\)
\(\alpha_j \sim Normal(\bar{\alpha}, \sigma)\)
\(\beta_j \sim Normal(\bar{\beta}, \tau)\)
\(\bar{\alpha}, \bar{\beta} \sim Normal(0, 1)\)
\(\sigma, \tau \sim Exponential(1)\)
if we collapse the two hyperparameters we can explore the correlation between them - multivariate
\([\alpha_j, \beta_j] \sim MVNormal([\bar{\alpha}, \bar{\beta}], R, [\sigma, \tau])\)
vector of features for cluster j
vector of means the same length of the number of features
correlation matrix (R)
vector of standard deviations, one for each feature
\(R \sim LKJCorr(4)\)
LKJ is a prior specifically for correlation matrices
- bigger values are more skeptical
###########
# non-centered varying slopes with and without covariance
<- list(
dat C = d$use.contraception,
D = as.integer(d$district),
U = d$urban,
A = standardize(d$age.centered),
K = d$living.children )
# no covariance
<- ulam(
mCDUnc alist(
~ bernoulli(p),
C logit(p) <- a[D] + b[D]*U,
# define effects using other parameters
> vector[61]:a <<- abar + za*sigma,
save> vector[61]:b <<- bbar + zb*tau,
save# z-scored effects
61]:za ~ normal(0,1),
vector[61]:zb ~ normal(0,1),
vector[# ye olde hyper-priors
c(abar,bbar) ~ normal(0,1),
c(sigma,tau) ~ exponential(1)
data=dat , chains=4 , cores=4 ) ) ,
it is hard to learn the correlation from any finite sample
LKJ correlation matrix priors - prior for correlation matrices
tends to have shapes
can be skeptical of extreme correlations
# covariance - centered
<- ulam(
mCDUcov alist(
~ bernoulli(p),
C logit(p) <- a[D] + b[D]*U,
# define effects using other parameters
> vector[61]:a <<- v[,1],
transpars> vector[61]:b <<- v[,2],
transpars# priors - centered correlated varying effects
61,2]:v ~ multi_normal(abar,Rho,sigma),
matrix[2]:abar ~ normal(0,1),
vector[2]:Rho ~ lkj_corr(4),
corr_matrix[2]:sigma ~ exponential(1)
vector[data=dat , chains=4 , cores=4 ) ) ,
centering vs non-centering to increase efficiency of models
centered = priors of priors (parameters inside the priors)
non-centered = changing model to not have hyper-priors (but be mathematically equivalent) to have increased efficiencies – using z scores
# covariance - non-centered
<- ulam(
mCDUcov_nc alist(
~ bernoulli(p),
C logit(p) <- a[D] + b[D]*U,
# define effects using other parameters
# this is the non-centered Cholesky machine
> vector[61]:a <<- abar[1] + v[,1],
transpars> vector[61]:b <<- abar[2] + v[,2],
transpars> matrix[61,2]:v <-
transparscompose_noncentered( sigma , L_Rho , Z ),
# priors - note that none have parameters inside them
# that is what makes them non-centered
2,61]:Z ~ normal( 0 , 1 ),
matrix[2]:abar ~ normal(0,1),
vector[2]:L_Rho ~ lkj_corr_cholesky( 4 ),
cholesky_factor_corr[2]:sigma ~ exponential(1),
vector[# convert Cholesky to Corr matrix
> matrix[2,2]:Rho <<- Chol_to_Corr(L_Rho)
gqdata=dat , chains=4 , cores=4 ) ) ,
nice to compare prior to posterior distribution to make sure the model learned something
correlated varying effects models are easier to fit in Bayesian framework
priors learn correlation structure - sometimes the research question is about this
varying effects can be correlated even if the prior doesn’t learn the correlations
- if you don’t account for it you just won’t have a parameter for it and thus partial pooling for it
Inconvenient Posteriors
transforming priors can help with divergent transitions because it changes the shape of the model
brms uses non-centered priors as default in multilevel models
- not always better
centered and non-centered priors are better in different contexts
centered: lots of data in each cluster (data probability dominant)
non-centered: many clusters, sparse evidence (prior dominant)