Lecture 02 - The Garden of Forking Data

Author

Isabella C. Richmond

Published

January 23, 2023

Rose / Thorn

Rose: understanding the predictive posterior this time

Thorn: what if you don’t know your misclassification rate

Globe Example

estimand: proportion of the globe covered in water - do people know what an estimand is?

a research/scientific question

estimator: statistical way of producing an estimate

generative model tests your estimator

Generative Model (globe)

p = proportion of water
W = water observations
N = number of tosses
L = land observations
N influences W and L because higher tosses means higher numbers
start with how the variables influence each other, i.e., causal model
- start conceptually, with science
- intervention on N is also an intervention on W and L

library(dagitty)
library(rethinking)

Loading required package: cmdstanr

This is cmdstanr version 0.7.1

- CmdStanR documentation and vignettes: mc-stan.org/cmdstanr

- CmdStan path: C:/Users/I_RICHMO/Documents/.cmdstan/cmdstan-2.34.1

- CmdStan version: 2.34.1

Loading required package: posterior

This is posterior version 1.5.0


Attaching package: 'posterior'

The following objects are masked from 'package:stats':

    mad, sd, var

The following objects are masked from 'package:base':

    %in%, match

Loading required package: parallel

rethinking (Version 2.40)


Attaching package: 'rethinking'

The following object is masked from 'package:stats':

    rstudent

d <- dagitty("dag {
                p -> W
                p -> L
                N -> W 
                N -> L }")

drawdag(d)

\[ W,L = f(p,N) \]

W and L are functions of p and N
DAGs are not generative, but they make causal models which allow us to make generative models
- they represent functional relationships

Bayesian Data Analysis

For each possible explanation of the sample,

count all the ways the sample could happen (given a particular explanation)
explanations with more ways to produce the sample are more plausible

Garden of Forking Data

relies on samples being independent
relative differences between probabilities are dependent on sample size (differences will be smaller with smaller sample sizes because there is less evidence)
normalizing to probability allows for interpretability and easier math
collection of probabilities is a posterior distribution

sample <- c("W", "L", "W", "W", "W", "L", "W", "L", "W")

W <- sum(sample=="W")
L <- sum(sample=="L")
p <- c(0, 0.25, 0.5, 0.75, 1)
ways <- sapply(p, function(q) (q*4)^W * ((1-q)*4)^L)
prob <- ways/sum(ways)
cbind(p, ways, prob)

        p ways       prob
[1,] 0.00    0 0.00000000
[2,] 0.25   27 0.02129338
[3,] 0.50  512 0.40378549
[4,] 0.75  729 0.57492114
[5,] 1.00    0 0.00000000

sim_globe <- function(p = 0.7, N = 9){
  sample(c("W", "L"), size = N, prob = c(p, 1-p), replace = T)
}

sim_globe()

[1] "L" "L" "W" "W" "W" "W" "W" "L" "L"

Testing

have to test
test using extreme values where you intuitively know the right answer
explore different sampling design

sim_globe <- function(p = 0.7, N = 9){
  
  sample(c("W", "L"), size = N, prob=c(p, 1-p), replace = T)
}

sim_globe()

[1] "W" "W" "L" "W" "W" "L" "W" "W" "W"

sim_globe(p=1, N=11)

 [1] "W" "W" "W" "W" "W" "W" "W" "W" "W" "W" "W"

sum(sim_globe(p=0.5, N=1e4) == "W")/1e4

[1] 0.5021

Function:

library(crayon)

make_bar <- function(q,size=20) {
    n <- round(q*size)
    s1 <- concat( rep("#",n) )
    s2 <- concat( rep(" ",size-n) )
    concat(s1,s2)
}

compute_posterior <- function(the_sample, poss = c(0,0.25,0.5,0.75,1)){
  W <- sum(the_sample=="W")
  L <- sum(the_sample=="L")
  ways <- sapply(poss, function(q) (q*4)^W * ((1-q)*4)^L)
  post <- ways/sum(ways)
  bars <- sapply(post, function(q) make_bar(q))
  data.frame(poss, ways, post = round(post,3), bars)
}

compute_posterior(sim_globe())

  poss ways  post                 bars
1 0.00    0 0.000                     
2 0.25   27 0.021                     
3 0.50  512 0.404 ########            
4 0.75  729 0.575 ###########         
5 1.00    0 0.000

Real Number Sampling

more possibilities = less probability in each option/outcome
- probability is spread out across many options
normalizing to probability allows our equation to calculate infinite number of “sides”/outcomes
- \[ p^W(1-p)^L \]
- p = probability
- density = probability when we are assessing infinite number of possibilities
shape of the posterior embodies sample size
- no min sample size -> just more uncertain posterior
- posterior distribution embodies sample size
no point estimates! estimate is entire posterior distribution
- can use summary points from post dist for communication purposes
intervals are merely indicators of the shape of the posterior distribution
- no “true interval” i.e., 95% CI doesn’t exist
- interval is just distribution lower/upper bounds

Analyze Sample + Summarize

post_samples <- rbeta(1e3, 6+1, 3+1)

dens(post_samples, lwd = 4, col = 2, xlab = "prop water", adj = 0.1)

curve(dbeta(x, 6+1, 3+1), add = T, lty = 2, lwd = 3)

posterior prediction = “what would we bet?”
- how many W’s do we expect to see in the next 10 tosses
for each sample of post dist, we can create a predictive distribution, then posterior predictive
- incorporates uncertainty from posterior distribution

post_samples <- rbeta(1e4, 6+1, 3+1)

pred_post <- sapply(post_samples, function(p) 
sum(sim_globe(p, 10)=="W"))

tab_post <- table(pred_post)

#for (i in 0:10) lines(c(i,i),c(0,tab_post[i+1]), lwd = 4, col = 4)

Misclassification (Bonus Round)

W* is misclassified due to sampling error and measurement process
- true W is unknown

library(dagitty)
library(rethinking)

d <- dagitty("dag {
                p -> W
                N -> W
                W -> Wm
                M -> Wm
             }")

drawdag(d)

incorporate measurement error with x (error rate of 10%)

sim_globe2 <- function(p = 0.7, N = 9, x = 0.1){
  
  true_sample <- sample(c("W", "L"), size = N, prob = c(p, 1-p), replace = T)
  
  obs_sample <- ifelse(runif(N) < x,
                       ifelse(true_sample == "W", "L", "W"),
                       true_sample)
  
  return(obs_sample)
  
}

how do you know the error rate?
don’t understand mechanism behind incorporating x but understand why x needs to be incorporated + consequences of not