Causal Diagram

Model Averaging for Causal Effect with Given Causal Diagrams

Recently, interesting paper1 by some epidemiologists and biostatisticians on combining estimations of treatment effect from various regression models which are generated by one causal diagram.

  • Their philosophy is that an a priori strategy of model averaging provides a means of integrating uncertainty in selection among candidate, causal models while also avoiding the temptation to report the most attractive estimate from a suite of equally valid alternatives.
  • Directed acyclic graph (DAG), based on background causal and substantive knowledge, are a useful tool for specifying one or more subsets of adjustment variables to obtain corresponding causal effect estimates. They said that a growing body of research supports DAG as the first (, sometimes last) step in etiologic disease modelling. In many cases, however, a DAG will support multiple, sufficient or minimally-sufficient adjustment sets. Even though all of these may theoretically produce unbiased effect estimates they may, in practice, yield somewhat distinct values, and the need to select between these models once again makes the research enterprise vulnerable to wish bias (brought by priori).
  • The goal of their averaging is to base inference on the evidence from multiple regression models, rather than a single, selected regression model. The interesting coefficient of regression model was estimated by (generalized) linear model.

What did they do?

  • They focus on the estimating the impact of some exposure (\(E\)) on the outcome (short for \(O\); such as some disease, \(D\)) for which a DAG is developed to characterize the subject matter knowledge about potential confounders2.

  • Assume that there are no important effect measure modifiers3 of this relationship, and that the DAG is a complete and accurate reflection of the causal relations in the target population.

  • They noticed that a single DAG may support many, theoretically unbiased adjustment sets. Further, many equally defensible regression models may lead to different conclusions regarding the research question of interest. Typically, a researcher will select one among multiple adjustment sets for risk modeling when reporting results. However, they suggested combining adjustment sets (confounders) with model averaging techniques to obtain causal estimates based on multiple, theoretically-unbiased models.

  • From each adjustment set, a generalized linear regression model is set up and the impact of \(E\) on \(O\) (risk ratio) is obtained from the estimated regression coefficient.

    • They use three techniques for averaging the results among multiple candidate models: information criteria weighting, inverse variance weighting, and bootstrapping. But these methods have no theoretical guarantee.
  • They illustrated these approaches with an real data example from the Pregnancy, Infection, and Nutrition (PIN) study. The interesting question is what’s the impact of body weight on the delivery method for a baby.


Comments

我认为, 这个问题没有进一步研究的价值:

  1. 由于该工作使用的是(广义)线性模型的模型平均, 所以, 只有实际数据分析的意义, 可以在一些文章中考虑增加这种实际例子.

  2. 这是由于这项工作没有研究因果关系 (DAG) 的不确定性, 而只是研究如何利用给定的 DAG; 在假设 DGA 正确的情况下, 所有的回归模型都是 (理论上) 正确的, 可能没有必要使用模型平均.

From this research, I learnt that we should know the following questions before we conduct a model averaging method

  1. model, candidate models and model uncertainty;
  2. loss function for weights (or say estimates).

我将关注于, 如何将因果关系的不确定性考虑在模型平均的过程中, 即考虑如何将因果关系作为候选模型 (而不是因果效应的估计方法作为平均的对象); 并且聚焦于预测问题而不是单纯的估计问题.

Questions to understand

  • What are the issues that researchers usually pay attention to on causal inference, i.e., what kind of relationship between Treatment (T) and Outcome (O) is worth researching?

    • Case-control studies4: The aim of case-control studies is to test the existence of possible risk factors of interest, and to estimate their association with the presence or absence of a disease, after adjusting for possible confounders.
    • Causal Inference:
  • How to access these relationship?

  • What is a model and how about model uncertainty in causal learning?

  • What can model selection do on these works? Is there any research on linear model selection?

  • Causal Diagram (Need more review)

    • About DAG and adjustment set (需要从更底层出发来思考这个问题)

      DAG: provide a visual summary of the investigators’ beliefs about the relationships between variables of interest. This is based on a priori knowledge obtained from previous research or other relevant literature.

      Adjustment set: a subset of variables, adjustment for which will remove confounding of the T-O relationship. Within a DAG, one may identify sufficient adjustment sets which fully adjust for confounding, but from which no element may be removed without their becoming insufficient.

      • Is DAG a model or an algorithm (or say estimate procedure)?

        ​ DAG leads to one (or more) adjustment set which results in a regression model where the coefficient of interesting T corresponds to the causal impact of T on O.

      • Whether the adjustment set always exists?

      • How to get the adjustment sets from a DAG?

      • How to estimate the causal effect based on an adjustment set?

        • Just by (generalized) linear model. (Need more theory)
    • What did Judea Pearl do?

    • What did 耿直 do?

      ​ Learn one or more DAGs (or say Bayesian network) from observational data.

    • How can we combine causal diagram with model averaging? (In a novel way)

      ​ We first need models, then consider averaging them.

Data analysis tasks5

task key characteristics and concepts example analytical tools/ methods causal knowledge needed? example question
description A quantitative overview of the data. The metrics of interest may range from simple descriptive statistics to complex visualization techniques. mean ± sd, box plots, proportions, unsupervised cluster analyses, time trends, generalized regression6 no What is the central tendency and spread of T-cell count, a marker of immune function, in wild spotted hyenas in Kenya?
prediction Identification of a set of explanatory variables that optimize variation explained in a dependent variable, with no focus on the causal or temporal structure among the explanatory variables of interest. This task often involves use of automated procedures to maximize model fit and leverages the joint distribution of multiple variables tree-based techniques, recurrent neural networks, unsupervised machine learning algorithms, generalized regression6 some What set of social and ecological factors explain maximum variation in T-cell count in wild spotted hyenas in Kenya?
association Assessment of the unadjusted relationship between two variables of interest. This relationship may be explored within strata of a few key other variables that may influence the association of interest and can inform future causal inference studies. Pearson or Spearman correlation coefficients, estimates from unadjusted generalized regression6 some How does social connectedness correlate with T-cell count in wild spotted hyenas in Kenya?
causal inference Obtain a causal (i.e., unbiased) effect of X on Y. This type of analysis requires knowledge on the causal and temporal relationship between X and Y, as well as third variables (confounders, mediators, effect modifiers, colliders) that may influence this relationship in order to control bias. use of directed acyclic graphs to reflect the research question, followed by an appropriate analytical strategy which can involve but are not limited to generalized regression6 , inverse probability weighting, structural equation modelling, path analysis, Rubin causal inference, and G-methods. yes Does social connectedness affect T-cell count in wild spotted hyenas in Kenya?

Reference

  1. Hamra, Ghassan B., Jay S. Kaufman, and Anjel Vahratian. “Model averaging for improving inference from causal diagrams.”; International journal of environmental research and public health 12.8 (2015): 9391-9407. 

  2. Confounding is a distortion of the association between an exposure (E) and an outcome (O) that occurs when the study groups differ with respect to other factors that influence the outcome. Unlike selection and information bias, which can be introduced by the investigator or by the subjects, confounding is a type of bias that can be adjusted for in the analysis, provided that the investigators have information on the status of study subjects with respect to potential confounding factors. See this lecture for more details. 

  3. Effect modification is distinct from confounding; it occurs when the magnitude of the effect of the primary exposure on an outcome (i.e., the association) differs depending on the level of a third variable. See this lecture for more details. 

  4. Viallefont, Valerie, Adrian E. Raftery, and Sylvia Richardson. “Variable selection and Bayesian model averaging in case‐control studies.” Statistics in medicine 20.21 (2001): 3215-3230. 

  5. Laubach, Zachary M., et al. “A biologist’s guide to model selection and causal inference.” Proceedings of the Royal Society B 288.1943 (2021): 20202815. 

  6. Includes generalized regression models (e.g. linear, Poisson, negative binomial, logistic) and generalized mixed models (e.g. linear mixed models, segmented mixed models, mixed models with splines).  2 3 4