Lecture 06 - Good & Bad Controls
Rose / Thorn
Rose: identifying and working through bad controls, table 2 fallacy
Thorn:
Randomization
- randomizing the treatment can remove the confound (only available for experiments)
- effectively removes all the other arrows in X
Causal Thinking
in an experiment, we cut causes of the treatment -> we randomize
simulating intervention mimics randomization
do(X) means intervene on X
example: simple confound
<- dagify(
dag
~ U,
X
~ U + X
Y
)
ggdag(dag) +
theme_dag()
stratifying by U removes causal relationship and allows to test effect of X -> Y
marginalize or average over control variables
- the coefficient is not usually satisfactory, need to marginalize
- the causal effect of X on Y is the distribution of Y when we change X, averaged over the distributions of the control variables
<- dagify(
dag
~ Cheetahs,
Baboons
~ Baboons + Cheetahs
Gazelle
)
ggdag(dag) +
theme_dag()
populations of each of these species influences the other
when cheetahs are present, baboons are scared and do not influence gazelle population
when cheetahs are absent, baboons eat and regulate gazelle population
to assess causal effect of baboons, need to average over cheetah population
Do-Calculus
allows us to determine if it is possible to answer our question using a DAG
do-calculus tells us if we need to make additional functional assumptions
backdoor criterion
shortcut to apply do-calculus to use your eyes
rule to find a set of variables to stratify by to yield estimate of our estimand
- identify all paths connecting treatment (X) to outcome (Y)
- paths with arrows entering X are backdoor paths (non-causal paths)
- find adjustment set that closes/blocks all backdoor paths
# simulate confounded Y
<- 200
N <- 0
b_XY <- -1
b_UY <- -1
b_UZ <- 1
b_ZX
set.seed(10)
<- rbern(N)
U <- rnorm(N, b_UZ*U)
Z <- rnorm(N, b_ZX*Z)
X <- rnorm(N, b_XY*X + b_UY*U)
Y <- list(Y=Y, X=X, Z=Z)
d
# ignore U,Z
<- quap(alist(
m_YX ~ dnorm(mu, sigma),
Y <- a + b_XY*X,
mu ~ dnorm(0,1),
a ~ dnorm(0, 1),
b_XY ~ dexp(1)
sigma data = d)
),
# stratify by Z
<- quap(alist(
m_YXZ ~ dnorm(mu, sigma),
Y <- a + b_XY*X + b_Z*Z,
mu ~ dnorm(0,1),
a c(b_XY, b_Z) ~ dnorm(0, 1),
~ dexp(1)
sigma data = d)
),
<- extract.samples(m_YX)
post <- extract.samples(m_YXZ)
post2
<- density(post$b_XY)
density_1 <- density(post2$b_XY)
density_2 plot(density_1, col=2, xlim = c(min(density_1$x,density_2$x), max(density_1$x,density_2$x)), xlab = "posterior b_XY", main = "red = confounded, blue = stratified") +
lines(density_2, col=4)
integer(0)
any variable you add to a model as part of the adjustment set (ie to control for), its coefficients are usually not interpretable
minimum adjustment set is not always the best set
doesn’t consider statistical efficiency
sometimes want to stratify by things that are not in the minimum adjustment set to make your model more efficient
<- dagify(
dag
~ G + U,
P ~ P + G + U
C
)
ggdag(dag) +
theme_dag()
P is a mediator and collider, so we can’t get the direct effect of G on C because we don’t have U
- can estimate of total effect of G on C
Good & Bad Controls
control variable: variable introduced to an analysis so that a causal estimate is possible
- good and bad controls
variables not being collinear is not a good reason for including/excluding variables
- collinearity can arise from many causal processes
post-treatment variables are often risky controls
if there is no backdoor path to variable of interest, you don’t need to control for it
# sim confounding by post-treatment variable
<- function(n=100,bXZ=1,bZY=1) {
f <- rnorm(n)
X <- rnorm(n)
u <- rnorm(n, bXZ*X + u)
Z <- rnorm(n, bZY*Z + u )
Y <- coef( lm(Y ~ X) )['X']
bX <- coef( lm(Y ~ X + Z) )['X']
bXZ return( c(bX,bXZ) )
}
<- mcreplicate( 1e4 , f(), mc.cores = 1) sim
[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]
<- density(sim[1,])
density_1 <- density(sim[2,])
density_2 plot(density_1, xlim=c(-1,0.8) , ylim=c(0,2.7), xlab = "posterior mean", main = "red = confounded, black = correct") +
lines(density_2, col=2)
integer(0)
case control bias (selection on outcome)
very bad to add descendents of your outcome to your model
weakly stratifying by the outcome (e.g., stratifying by Z)
<- dagify(
dag ~ X,
Y ~ Y
Z
)
ggdag(dag) +
theme_dag()
<- function(n=100,bXY=1,bYZ=1) {
f <- rnorm(n)
X <- rnorm(n, bXY*X )
Y <- rnorm(n, bYZ*Y )
Z <- coef( lm(Y ~ X) )['X']
bX <- coef( lm(Y ~ X + Z) )['X']
bXZ return( c(bX,bXZ) )
}
<- mcreplicate( 1e4 , f(), mc.cores = 1 ) sim
[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]
<- density(sim[1,])
density_1 <- density(sim[2,])
density_2 plot(density_1, xlim=c(-1,0.8) , ylim=c(0,2.7), xlab = "posterior mean", main = "red = confounded, black = correct") +
lines(density_2, col=2)
integer(0)
precision parasite
no backdoors because Z is not connected to Y except through X
not good to stratify Z because you are explaining part of the effect of X with Z
<- dagify(
dag ~ X,
Y ~ Z
X
)
ggdag(dag) +
theme_dag()
<- function(n=100,bZX=1,bXY=1) {
f <- rnorm(n)
Z <- rnorm(n, bZX*Z )
X <- rnorm(n, bXY*X )
Y <- coef( lm(Y ~ X) )['X']
bX <- coef( lm(Y ~ X + Z) )['X']
bXZ return( c(bX,bXZ) )
}
<- mcreplicate( 1e4 , f(n=50), mc.cores = 1 ) sim
[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]
<- density(sim[1,])
density_1 <- density(sim[2,])
density_2 plot(density_1, xlim=c(-1,0.8) , ylim=c(0,2.7), xlab = "posterior mean", main = "red = confounded, black = correct") +
lines(density_2, col=2)
integer(0)
bias amplification
X and Y confounded by u
adding Z biases your answer because it “double” activates the confound
<- dagify(
dag ~ X + u,
Y ~ Z + u
X
)
ggdag(dag) +
theme_dag()
<- function(n=100,bZX=1,bXY=1) {
f <- rnorm(n)
Z <- rnorm(n)
u <- rnorm(n, bZX*Z + u )
X <- rnorm(n, bXY*X + u )
Y <- coef( lm(Y ~ X) )['X']
bX <- coef( lm(Y ~ X + Z) )['X']
bXZ return( c(bX,bXZ) )
}
# true value zero
<- mcreplicate( 1e4 , f(bXY=0), mc.cores = 1) sim
[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]
<- density(sim[1,])
density_1 <- density(sim[2,])
density_2 plot(density_1, xlim=c(0,1) , ylim=c(0,6), xlab = "posterior mean", main = "red = more biased, black = biased") +
lines(density_2, col=2)
integer(0)
covariation in X & Y requires variation in their causes
within each level of Z, less variation in X
additionally, the confound u becomes relatively more important within each level of Z
double biasing
<- function(...,col=1,lwd=1,dlwd=2) {
abline_w abline(...,col="white",lwd=lwd+dlwd)
abline(...,col=col,lwd=lwd)
}
<- 1000
n <- rbern(n)
Z <- rnorm(n)
u <- rnorm(n, 7*Z + u )
X <- rnorm(n, 0*X + u )
Y
<- c( col.alpha(2,0.5) , col.alpha(4,0.5) )
cols plot( X , Y , col=cols[Z+1] , lwd=2 ) +
abline_w( lm(Y~X) , lwd=3 ) +
abline_w( lm(Y[Z==1]~X[Z==1]) , lwd=3 , col=4 ) +
abline_w( lm(Y[Z==0]~X[Z==0]) , lwd=3 , col=2 )
integer(0)
- the damage increases once stratified by Z (red and blue lines)
Summary
adding control variables can be worse than omitting
there are good controls - backdoor criterion
make assumptions explicit
Bonus - Table 2 Fallacy
not all coefficients are causal effects
statistical model designed to identify X -> Y will not also identify effects of control variables
\(Y_i \sim Normal(\mu_i, \sigma)\)
\(\mu_i = \alpha + \beta_xX_i + \beta_SS_i + \beta_AA_i\)
think through DAG for each control variable to see what the coefficient actually means
no interpretation without causal representation
TODO
- read Table 2 Fallacy paper
- what are other types of causal inference that are not multiple regression?