Papers
Topics
Authors
Recent
2000 character limit reached

Semiparametric zero-inflated modeling in multi-ethnic study of atherosclerosis (MESA)

Published 1 Oct 2012 in stat.AP | (1210.0300v1)

Abstract: We analyze the Agatston score of coronary artery calcium (CAC) from the Multi-Ethnic Study of Atherosclerosis (MESA) using the semiparametric zero-inflated modeling approach, where the observed CAC scores from this cohort consist of high frequency of zeroes and continuously distributed positive values. Both partially constrained and unconstrained models are considered to investigate the underlying biological processes of CAC development from zero to positive, and from small amount to large amount. Different from existing studies, a model selection procedure based on likelihood cross-validation is adopted to identify the optimal model, which is justified by comparative Monte Carlo studies. A shrinkaged version of cubic regression spline is used for model estimation and variable selection simultaneously. When applying the proposed methods to the MESA data analysis, we show that the two biological mechanisms influencing the initiation of CAC and the magnitude of CAC when it is positive are better characterized by an unconstrained zero-inflated normal model. Our results are significantly different from those in published studies, and may provide further insights into the biological mechanisms underlying CAC development in humans. This highly flexible statistical framework can be applied to zero-inflated data analyses in other areas.

Citations (10)

Summary

  • The paper develops a flexible semiparametric zero-inflated normal model using penalized splines that identify distinct effects on CAC initiation and progression.
  • The model employs likelihood-based cross-validation to demonstrate that unconstrained models outperform proportional constraints in predicting coronary artery calcium.
  • The analysis of MESA data reveals that nonlinear covariate effects like age and systolic blood pressure are crucial, while diastolic blood pressure shows no significant influence.

Semiparametric Zero-Inflated Modeling of Coronary Artery Calcium in MESA

Introduction and Motivation

This paper addresses the statistical challenges associated with analyzing Agatston scores of coronary artery calcium (CAC) arising from the Multi-Ethnic Study of Atherosclerosis (MESA). The essential challenge is the zero-inflated nature of the CAC data: a significant portion of the population has undetectable CAC (zero values), while the remainder have positive, continuous scores. Traditional parametric models that assume linearity and absence of zero-inflation are not well suited for such data, necessitating the use of advanced modeling approaches that capture both the zero-inflation and complex, potentially nonlinear covariate effects.

The authors develop a flexible semiparametric zero-inflated normal (ZIN) modeling framework capable of representing nonlinear relationships and variable selection via penalized splines with shrinkage. This formulation enables formal testing of hypotheses regarding whether covariates act proportionally on both the risk of having positive CAC and the quantity of CAC when present. The model selection procedure is performed using likelihood-based cross-validation, diverging from previous reliance on more rigid and less empirically justified selection metrics.

Model Formulation

The semiparametric ZIN model assumes that the observed response YY (log-plus-one transformed CAC) is generated as a two-part process: with probability $1-p$, Y=0Y=0 (i.e., no detectable CAC), and with probability pp, YY is from a normal distribution with mean μ\mu and variance σ2\sigma^2. Parameterization is as follows:

  • The zero-inflation component models pp via a (e.g., logit, probit) link function as a sum of parametric effects (e.g., sex, ethnicity, diabetes, smoking) and nonparametric smooth functions of continuous covariates (e.g., age, BMI, blood pressure, cholesterol).
  • The positive response component similarly models μ\mu with its own set of parametric and smooth functions.

A central investigation is whether the effects of certain covariates are proportionally related in the two parts. The framework allows for unconstrained (completely independent), fully constrained (all effects proportional), or partially constrained (some effects proportional, others not) specifications.

Estimation is performed with penalized likelihood, using cubic regression splines augmented with shrinkage penalties for simultaneous variable selection and effect estimation. This inclusion is critical for high-dimensional covariate spaces typical of epidemiological data.

Model Selection and Evaluation

Given the complexity of partially constrained models and their potential biological interpretations, model selection is non-trivial. The authors adopt Monte Carlo cross-validation using out-of-sample log-likelihood as the primary criterion, arguing for its robustness and interpretability in the context of mixture distributions, supported by simulation studies. Competing metrics such as AUC and MSE (and bias-corrected MSE) are also considered but found less reliable for the overall zero-inflated context.

Simulation studies confirm the high power (>90% for moderate to large nn) of cross-validated likelihood to correctly identify the true model structure, outperforming alternative criteria especially as sample size grows.

Application to MESA

The method is applied to MESA data (n=6672n=6672), using log(CAC+1)\log(\mathrm{CAC}+1) as the response and a comprehensive set of demographic and clinical predictors. Both parametric (linear) and nonparametric (smoothed) effects are allowed, with partially constrained models included to test for shared biological mechanisms influencing both the presence and magnitude of CAC.

Key findings include:

  • The unconstrained semiparametric ZIN model exhibits superior prediction performance over all considered constrained or partially constrained models, as quantified by cross-validated likelihood. The addition of partial proportional constraints on Age or systolic blood pressure (SBP) does not improve out-of-sample fit.
  • The statistical evidence rejects the proportionality assumption between the covariate effects on the zero-inflation and mean response parts, contradicting earlier MESA studies which favored constrained models.
  • Nonlinear effects are prominent: age and SBP show marked nonlinearities, notably in their effects on CAC progression.
  • LDL's effect is entirely associated with CAC presence but not with its magnitude—a result only attainable via the shrinkage spline approach.
  • Diastolic blood pressure (DBP) is not a significant predictor for either process, its effect being shrunk to zero.

Theoretical and Practical Implications

Practically, the results have critical implications for risk stratification and understanding pathophysiological mechanisms: the factors that initiate CAC are statistically and mechanistically distinct from those driving its progression. All evidence supports modeling them separately to avoid misspecification errors.

Theoretically, the proposed flexible semiparametric zero-inflated modeling framework, with penalized spline-based variable selection and robust, likelihood-based cross-validation for model selection, is readily applicable to a wide variety of biomedical and environmental zero-inflated outcome settings. Its flexibility enables joint estimation of nonlinear effects and parsimonious removal of irrelevant predictors, which is crucial when working with high-dimensional data and complex biological interrelationships.

The rejection of proportional mechanisms in CAC development suggests that biomarker-driven process modeling in cardiovascular epidemiology (and by extension, other domains with similar data structures) must accommodate heterogeneous pathways for disease initiation and progression.

Future Directions

Open methodological directions include theoretical exploration of the penalized spline shrinkage properties in mixture models and extension to generalized link or random effect structures to accommodate longitudinal or clustered designs. There are also opportunities to investigate causal inference within this semiparametric zero-inflated framework, especially in the presence of complex mediation and interaction effects.

Conclusion

Semiparametric zero-inflated modeling with flexible partial constraints and rigorous likelihood-based cross-validation provides a statistically robust and biologically informative framework for analyzing zero-inflated, high-dimensional data. Application to MESA data demonstrates strong evidence for distinct biological processes underlying coronary artery calcium initiation and progression, recommending against proportionality assumptions implicit in many previously reported models. This approach is widely extensible to other zero-inflated settings requiring nuanced, nonparametric effect estimation and principled model selection.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.