Statistical Modeling for a Case Control Study

This example demonstrates how to use cmest for a case control study. For this purpose, we simulate some data containing a continuous baseline confounder \(C_1\), a binary baseline confounder \(C_2\), a binary exposure \(A\), a binary mediator \(M\) and a binary outcome \(Y\). We sample 2000 cases out of all cases and sample 2000 controls out of all controls. The true regression models for \(A\), \(M\) and \(Y\) are: \[logit(E(A|C_1,C_2))=0.2+0.5C_1+0.1C_2\] \[logit(E(M|A,C_1,C_2))=1+2A+1.5C_1+0.8C_2\] \[logit(E(Y|A,M,C_1,C_2)))=-5+0.8A-1.8M+0.5AM+0.3C_1-0.6C_2\]

set.seed(1)
# data simulation
expit <- function(x) exp(x)/(1+exp(x))
n <- 1000000
C1 <- rnorm(n, mean = 1, sd = 0.1)
C2 <- rbinom(n, 1, 0.6)
A <- rbinom(n, 1, expit(0.2 + 0.5*C1 + 0.1*C2))
M <- rbinom(n, 1, expit(1 + 2*A + 1.5*C1 + 0.8*C2))
Y <- rbinom(n, 1, expit(-5 + 0.8*A - 1.8*M + 0.5*A*M + 0.3*C1 - 0.6*C2))
yprevalence <- sum(Y)/n
data <- data.frame(A, M, Y, C1, C2)
case_indice <- sample(which(data$Y == 1), 2000, replace = FALSE)
control_indice <- sample(which(data$Y == 0), 2000, replace = FALSE)
data <- data[c(case_indice, control_indice), ]

The DAG for this scientific setting is:

library(CMAverse)

## Registered S3 methods overwritten by 'lme4':
##   method                          from
##   cooks.distance.influence.merMod car 
##   influence.merMod                car 
##   dfbeta.influence.merMod         car 
##   dfbetas.influence.merMod        car

cmdag(outcome = "Y", exposure = "A", mediator = "M",
      basec = c("C1", "C2"), postc = NULL, node = TRUE, text_col = "white")

For a case control study, we set the casecontrol argument to be TRUE. It requires that either the prevalence of the case be known or the case be rare. We use the regression-based approach for illustration.

If the prevalence of the case is known, we specify it by the yprevalence argument. The results are:

res_yprevelence <- cmest(data = data, model = "rb", casecontrol = TRUE, yprevalence = yprevalence,
                         outcome = "Y", exposure = "A",
                         mediator = "M", basec = c("C1", "C2"), EMint = TRUE,
                         mreg = list("logistic"), yreg = "logistic",
                         astar = 0, a = 1, mval = list(1), 
                         estimation = "paramfunc", inference = "delta")

summary(res_yprevelence)

## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## svyglm(formula = Y ~ A + M + A * M + C1 + C2, design = getCall(x$reg.output$yreg)$design, 
##     family = getCall(x$reg.output$yreg)$family)
## 
## Survey design:
## survey::svydesign(~1, data = getCall(x$reg.output.summary$yreg$survey.design)$data, 
##     weights = getCall(x$reg.output.summary$yreg$survey.design)$weights)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.97446    0.39866 -12.478  < 2e-16 ***
## A            1.08507    0.39955   2.716  0.00664 ** 
## M           -1.66061    0.22032  -7.537  5.9e-14 ***
## C1           0.14732    0.34430   0.428  0.66875    
## C2          -0.61153    0.06887  -8.879  < 2e-16 ***
## A:M          0.19323    0.40922   0.472  0.63682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 0.9919739)
## 
## Number of Fisher Scoring iterations: 9
## 
## 
## # Mediator regressions: 
## 
## Call:
## svyglm(formula = M ~ A + C1 + C2, design = getCall(x$reg.output$mreg[[1L]])$design, 
##     family = getCall(x$reg.output$mreg[[1L]])$family)
## 
## Survey design:
## survey::svydesign(~1, data = getCall(x$reg.output.summary$mreg[[1L]]$survey.design)$data, 
##     weights = getCall(x$reg.output.summary$mreg[[1L]]$survey.design)$weights)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.7836     1.5384   0.509   0.6105    
## A             1.9558     0.3439   5.687 1.38e-08 ***
## C1            1.6832     1.5600   1.079   0.2807    
## C2            0.7054     0.2965   2.379   0.0174 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.045659)
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Effect decomposition on the odds ratio scale for a case control study via the regression-based approach
##  
## Closed-form parameter function estimation with 
##  delta method standard errors, confidence intervals and p-values 
##  
##                Estimate Std.error  95% CIL 95% CIU    P.val    
## Rcde            3.59054   0.31733  3.01948   4.270  < 2e-16 ***
## Rpnde           3.44231   0.37396  2.78213   4.259  < 2e-16 ***
## Rtnde           3.56427   0.30685  3.01087   4.219  < 2e-16 ***
## Rpnie           0.83794   0.04473  0.75471   0.930 0.000925 ***
## Rtnie           0.86763   0.05490  0.76644   0.982 0.024823 *  
## Rte             2.98666   0.27610  2.49171   3.580  < 2e-16 ***
## ERcde           2.09752   0.25249  1.60265   2.592  < 2e-16 ***
## ERintref        0.34479   0.25148 -0.14810   0.838 0.170361    
## ERintmed       -0.29359   0.21623 -0.71740   0.130 0.174540    
## ERpnie         -0.16206   0.04473 -0.24972  -0.074 0.000291 ***
## ERcde(prop)     1.05580   0.03792  0.98148   1.130  < 2e-16 ***
## ERintref(prop)  0.17355   0.12318 -0.06788   0.415 0.158862    
## ERintmed(prop) -0.14778   0.10636 -0.35624   0.061 0.164683    
## ERpnie(prop)   -0.08157   0.02901 -0.13843  -0.025 0.004921 ** 
## pm             -0.22935   0.11082 -0.44656  -0.012 0.038494 *  
## int             0.02577   0.01922 -0.01190   0.063 0.179931    
## pe             -0.05580   0.03792 -0.13013   0.019 0.141133    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Rcde: controlled direct effect odds ratio; Rpnde: pure natural direct effect odds ratio; Rtnde: total natural direct effect odds ratio; Rpnie: pure natural indirect effect odds ratio; Rtnie: total natural indirect effect odds ratio; Rte: total effect odds ratio; ERcde: excess relative risk due to controlled direct effect; ERintref: excess relative risk due to reference interaction; ERintmed: excess relative risk due to mediated interaction; ERpnie: excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; ERintref(prop): proportion ERintref; ERintmed(prop): proportion ERintmed; ERpnie(prop): proportion ERpnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $yval
## [1] "1"
## 
## $mval
## $mval[[1]]
## [1] 1
## 
## 
## $basecval
## $basecval[[1]]
## [1] 1.002184
## 
## $basecval[[2]]
## [1] 0.5255

If the prevalence of the case is unknown but we know the case is rare, we set the yrare argument to be TRUE. The results are:

res_yrare <- cmest(data = data, model = "rb", casecontrol = TRUE, yrare = TRUE,
                   outcome = "Y", exposure = "A",
                   mediator = "M", basec = c("C1", "C2"), EMint = TRUE,
                   mreg = list("logistic"), yreg = "logistic",
                   astar = 0, a = 1, mval = list(1),
                   estimation = "paramfunc", inference = "delta")

summary(res_yrare)

## Causal Mediation Analysis
## 
## # Outcome regression:
## 
## Call:
## glm(formula = Y ~ A + M + A * M + C1 + C2, family = binomial(), 
##     data = getCall(x$reg.output$yreg)$data, weights = getCall(x$reg.output$yreg)$weights)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.02664  -1.14853  -0.07283   1.04234   1.79068  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.87856    0.38849   2.261   0.0237 *  
## A            0.96088    0.38782   2.478   0.0132 *  
## M           -1.69239    0.21731  -7.788 6.82e-15 ***
## C1           0.07022    0.33423   0.210   0.8336    
## C2          -0.62047    0.06664  -9.310  < 2e-16 ***
## A:M          0.33475    0.39764   0.842   0.3999    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5545.2  on 3999  degrees of freedom
## Residual deviance: 5163.4  on 3994  degrees of freedom
## AIC: 5175.4
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## # Mediator regressions: 
## 
## Call:
## glm(formula = M ~ A + C1 + C2, family = binomial(), data = getCall(x$reg.output$mreg[[1L]])$data, 
##     weights = getCall(x$reg.output$mreg[[1L]])$weights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2695   0.1089   0.1460   0.2770   0.5010  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.7862     1.4552   0.540   0.5890    
## A             1.9627     0.3474   5.650  1.6e-08 ***
## C1            1.6876     1.4587   1.157   0.2473    
## C2            0.7016     0.2982   2.353   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.89  on 1999  degrees of freedom
## Residual deviance: 406.21  on 1996  degrees of freedom
## AIC: 414.21
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## # Effect decomposition on the odds ratio scale for a case control study via the regression-based approach
##  
## Closed-form parameter function estimation with 
##  delta method standard errors, confidence intervals and p-values 
##  
##                Estimate Std.error  95% CIL 95% CIU    P.val    
## Rcde            3.65329   0.32168  3.07422   4.341  < 2e-16 ***
## Rpnde           3.40407   0.35328  2.77754   4.172  < 2e-16 ***
## Rtnde           3.60919   0.30906  3.05155   4.269  < 2e-16 ***
## Rpnie           0.83319   0.04532  0.74893   0.927 0.000794 ***
## Rtnie           0.88340   0.04962  0.79131   0.986 0.027294 *  
## Rte             3.00715   0.27763  2.50939   3.604  < 2e-16 ***
## ERcde           2.13416   0.25499  1.63438   2.634  < 2e-16 ***
## ERintref        0.26991   0.21978 -0.16085   0.701 0.219412    
## ERintmed       -0.23012   0.18887 -0.60029   0.140 0.223076    
## ERpnie         -0.16681   0.04532 -0.25564  -0.078 0.000233 ***
## ERcde(prop)     1.06328   0.03649  0.99175   1.135  < 2e-16 ***
## ERintref(prop)  0.13447   0.10691 -0.07507   0.344 0.208453    
## ERintmed(prop) -0.11465   0.09217 -0.29530   0.066 0.213549    
## ERpnie(prop)   -0.08311   0.02927 -0.14047  -0.026 0.004517 ** 
## pm             -0.19776   0.09648 -0.38686  -0.009 0.040401 *  
## int             0.01983   0.01642 -0.01235   0.052 0.227195    
## pe             -0.06328   0.03649 -0.13481   0.008 0.082916 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Rcde: controlled direct effect odds ratio; Rpnde: pure natural direct effect odds ratio; Rtnde: total natural direct effect odds ratio; Rpnie: pure natural indirect effect odds ratio; Rtnie: total natural indirect effect odds ratio; Rte: total effect odds ratio; ERcde: excess relative risk due to controlled direct effect; ERintref: excess relative risk due to reference interaction; ERintmed: excess relative risk due to mediated interaction; ERpnie: excess relative risk due to pure natural indirect effect; ERcde(prop): proportion ERcde; ERintref(prop): proportion ERintref; ERintmed(prop): proportion ERintmed; ERpnie(prop): proportion ERpnie; pm: overall proportion mediated; int: overall proportion attributable to interaction; pe: overall proportion eliminated)
## 
## Relevant variable values: 
## $a
## [1] 1
## 
## $astar
## [1] 0
## 
## $yval
## [1] "1"
## 
## $mval
## $mval[[1]]
## [1] 1
## 
## 
## $basecval
## $basecval[[1]]
## [1] 1.002184
## 
## $basecval[[2]]
## [1] 0.5255

2024-07-01