Lecture 06 - Good & Bad Controls

Author

Isabella C. Richmond

Published

February 22, 2024

Rose / Thorn

Rose: identifying and working through bad controls, table 2 fallacy

Thorn:

Randomization

randomizing the treatment can remove the confound (only available for experiments)
effectively removes all the other arrows in X

Causal Thinking

in an experiment, we cut causes of the treatment -> we randomize
simulating intervention mimics randomization
do(X) means intervene on X
example: simple confound

dag <- dagify(

    X ~ U,

    Y ~ U + X

)

ggdag(dag) +

    theme_dag()

stratifying by U removes causal relationship and allows to test effect of X -> Y
marginalize or average over control variables
- the coefficient is not usually satisfactory, need to marginalize
- the causal effect of X on Y is the distribution of Y when we change X, averaged over the distributions of the control variables

dag <- dagify(

    Baboons ~ Cheetahs,

    Gazelle ~ Baboons + Cheetahs

)

ggdag(dag) +

    theme_dag()

populations of each of these species influences the other
- when cheetahs are present, baboons are scared and do not influence gazelle population
- when cheetahs are absent, baboons eat and regulate gazelle population
- to assess causal effect of baboons, need to average over cheetah population

Do-Calculus

allows us to determine if it is possible to answer our question using a DAG
do-calculus tells us if we need to make additional functional assumptions
backdoor criterion
- shortcut to apply do-calculus to use your eyes
- rule to find a set of variables to stratify by to yield estimate of our estimand
- 1. identify all paths connecting treatment (X) to outcome (Y)
  2. paths with arrows entering X are backdoor paths (non-causal paths)
  3. find adjustment set that closes/blocks all backdoor paths

# simulate confounded Y
N <- 200
b_XY <- 0
b_UY <- -1
b_UZ <- -1
b_ZX <- 1

set.seed(10)
U <- rbern(N)
Z <- rnorm(N, b_UZ*U)
X <- rnorm(N, b_ZX*Z)
Y <- rnorm(N, b_XY*X + b_UY*U)
d <- list(Y=Y, X=X, Z=Z)

# ignore U,Z 
m_YX <- quap(alist(
  Y ~ dnorm(mu, sigma),
  mu <- a + b_XY*X,
  a ~ dnorm(0,1),
  b_XY ~ dnorm(0, 1),
  sigma ~ dexp(1)
), data = d)

# stratify by Z 
m_YXZ <- quap(alist(
    Y ~ dnorm(mu, sigma),
  mu <- a + b_XY*X + b_Z*Z,
  a ~ dnorm(0,1),
  c(b_XY, b_Z) ~ dnorm(0, 1),
  sigma ~ dexp(1)
), data = d)

post <- extract.samples(m_YX)
post2 <- extract.samples(m_YXZ)

density_1 <- density(post$b_XY)
density_2 <- density(post2$b_XY)
plot(density_1, col=2, xlim = c(min(density_1$x,density_2$x), max(density_1$x,density_2$x)), xlab = "posterior b_XY", main = "red = confounded, blue = stratified") + 
lines(density_2, col=4)

integer(0)

any variable you add to a model as part of the adjustment set (ie to control for), its coefficients are usually not interpretable
minimum adjustment set is not always the best set
- doesn’t consider statistical efficiency
- sometimes want to stratify by things that are not in the minimum adjustment set to make your model more efficient

dag <- dagify(

    P ~ G + U,
    C ~ P + G + U

)

ggdag(dag) +

    theme_dag()

P is a mediator and collider, so we can’t get the direct effect of G on C because we don’t have U
- can estimate of total effect of G on C

Good & Bad Controls

control variable: variable introduced to an analysis so that a causal estimate is possible
- good and bad controls
variables not being collinear is not a good reason for including/excluding variables
- collinearity can arise from many causal processes
post-treatment variables are often risky controls
if there is no backdoor path to variable of interest, you don’t need to control for it

# sim confounding by post-treatment variable

f <- function(n=100,bXZ=1,bZY=1) {
    X <- rnorm(n)
    u <- rnorm(n)
    Z <- rnorm(n, bXZ*X + u)
    Y <- rnorm(n, bZY*Z + u )
    bX <- coef( lm(Y ~ X) )['X']
    bXZ <- coef( lm(Y ~ X + Z) )['X']
    return( c(bX,bXZ) )
}

sim <- mcreplicate( 1e4 , f(), mc.cores = 1)

[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]

density_1 <- density(sim[1,])
density_2 <- density(sim[2,])
plot(density_1,  xlim=c(-1,0.8) , ylim=c(0,2.7), xlab = "posterior mean", main = "red = confounded, black = correct") + 
lines(density_2, col=2)

integer(0)

case control bias (selection on outcome)
- very bad to add descendents of your outcome to your model
- weakly stratifying by the outcome (e.g., stratifying by Z)

dag <- dagify(
    Y ~ X,
    Z ~ Y
)

ggdag(dag) +
    theme_dag()

f <- function(n=100,bXY=1,bYZ=1) {
    X <- rnorm(n)
    Y <- rnorm(n, bXY*X )
    Z <- rnorm(n, bYZ*Y )
    bX <- coef( lm(Y ~ X) )['X']
    bXZ <- coef( lm(Y ~ X + Z) )['X']
    return( c(bX,bXZ) )
}

sim <- mcreplicate( 1e4 , f(), mc.cores = 1 )

[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]

density_1 <- density(sim[1,])
density_2 <- density(sim[2,])
plot(density_1,  xlim=c(-1,0.8) , ylim=c(0,2.7), xlab = "posterior mean", main = "red = confounded, black = correct") + 
lines(density_2, col=2)

integer(0)

precision parasite
- no backdoors because Z is not connected to Y except through X
- not good to stratify Z because you are explaining part of the effect of X with Z

dag <- dagify(
    Y ~ X,
    X ~ Z
)

ggdag(dag) +
    theme_dag()

f <- function(n=100,bZX=1,bXY=1) {
    Z <- rnorm(n)
    X <- rnorm(n, bZX*Z )
    Y <- rnorm(n, bXY*X )
    bX <- coef( lm(Y ~ X) )['X']
    bXZ <- coef( lm(Y ~ X + Z) )['X']
    return( c(bX,bXZ) )
}

sim <- mcreplicate( 1e4 , f(n=50), mc.cores = 1 )

[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]

density_1 <- density(sim[1,])
density_2 <- density(sim[2,])
plot(density_1,  xlim=c(-1,0.8) , ylim=c(0,2.7), xlab = "posterior mean", main = "red = confounded, black = correct") + 
lines(density_2, col=2)

integer(0)

bias amplification
- X and Y confounded by u
- adding Z biases your answer because it “double” activates the confound

dag <- dagify(
    Y ~ X + u,
    X ~ Z + u
)

ggdag(dag) +
    theme_dag()

f <- function(n=100,bZX=1,bXY=1) {
    Z <- rnorm(n)
    u <- rnorm(n)
    X <- rnorm(n, bZX*Z + u )
    Y <- rnorm(n, bXY*X + u )
    bX <- coef( lm(Y ~ X) )['X']
    bXZ <- coef( lm(Y ~ X + Z) )['X']
    return( c(bX,bXZ) )
}

# true value zero
sim <- mcreplicate( 1e4 , f(bXY=0), mc.cores = 1)

[ 1000 / 10000 ]
[ 2000 / 10000 ]
[ 3000 / 10000 ]
[ 4000 / 10000 ]
[ 5000 / 10000 ]
[ 6000 / 10000 ]
[ 7000 / 10000 ]
[ 8000 / 10000 ]
[ 9000 / 10000 ]
[ 10000 / 10000 ]

density_1 <- density(sim[1,])
density_2 <- density(sim[2,])
plot(density_1,  xlim=c(0,1) , ylim=c(0,6), xlab = "posterior mean", main = "red = more biased, black = biased") + 
lines(density_2, col=2)

integer(0)

covariation in X & Y requires variation in their causes
within each level of Z, less variation in X
additionally, the confound u becomes relatively more important within each level of Z
double biasing

abline_w <- function(...,col=1,lwd=1,dlwd=2) {
    abline(...,col="white",lwd=lwd+dlwd)
    abline(...,col=col,lwd=lwd)
}

n <- 1000
Z <- rbern(n)
u <- rnorm(n)
X <- rnorm(n, 7*Z + u )
Y <- rnorm(n, 0*X + u )

cols <- c( col.alpha(2,0.5) , col.alpha(4,0.5) )
plot( X , Y  , col=cols[Z+1] , lwd=2 ) + 
  abline_w( lm(Y~X) , lwd=3 ) + 
  abline_w( lm(Y[Z==1]~X[Z==1]) , lwd=3 , col=4 ) + 
  abline_w( lm(Y[Z==0]~X[Z==0]) , lwd=3 , col=2 )

integer(0)

the damage increases once stratified by Z (red and blue lines)

Summary

adding control variables can be worse than omitting
there are good controls - backdoor criterion
make assumptions explicit

Bonus - Table 2 Fallacy

not all coefficients are causal effects
statistical model designed to identify X -> Y will not also identify effects of control variables
\(Y_i \sim Normal(\mu_i, \sigma)\)
\(\mu_i = \alpha + \beta_xX_i + \beta_SS_i + \beta_AA_i\)
think through DAG for each control variable to see what the coefficient actually means
no interpretation without causal representation

TODO

read Table 2 Fallacy paper
what are other types of causal inference that are not multiple regression?