CAP Model for HD Diagnosis Prediction
- The CAP model is a two-covariate accelerated-failure-time model that predicts time to Huntington disease diagnosis by combining age and HTT CAG repeat length.
- It employs a biologically motivated CAP score, formulated as Age × (CAG + β2), ensuring simplicity and reliable discrimination under heavy censoring.
- The model facilitates clinical trial enrichment and sample size estimation by providing interpretable risk thresholds for patient selection.
The CAG-Age Product (CAP) model is a parametric, two-covariate accelerated-failure-time (AFT) model designed for predicting time to clinical diagnosis in Huntington disease (HD). It combines genetic information (HTT CAG repeat length) and demographic data (age at enrollment) in a biologically motivated product form, delivering a simple yet robust prognostic framework with direct applications in clinical risk stratification, trial enrichment, and paper design.
1. Mathematical Formulation and Model Specification
The CAP model specifies time to diagnosis, , for subject as follows:
where:
- : years from paper enrollment to HD diagnosis,
- : age at enrollment,
- : HTT CAG repeat length,
- : model error term.
The model introduces a "CAP score" : so that the model reduces to:
Parameter interpretations:
- (intercept): expected when .
- (scale): controls the compressive effect of on time to diagnosis.
- (centering constant): shifts the CAG axis to set an interpretable reference value and improve model fit.
Standard published estimates are , , .
2. Theoretical Underpinnings and Assumptions
The CAP model is developed under several key assumptions:
- AFT structure: Covariates act multiplicatively on the time scale, which is suitable for progressive disorders like HD.
- Parametric error: Uses a logistic error term, yielding a log-logistic marginal distribution for .
- Independent censoring: Censoring is conditionally independent of diagnostic time, given age and CAG.
- Model selection: Zhang et al. (2011) conducted model comparisons across main effects and interaction models using AIC and prediction error, with the log-logistic AFT interaction form (the CAP model) emerging as optimal.
This model’s simplicity contrasts with multivariate approaches while preserving interpretability and biological relevance.
3. Validation and Performance Metrics under Censoring
Validation of the CAP model leverages external ENROLL-HD data and metrics robust to high right-censoring (approx. 78%). Core metrics include:
3.1 Uno’s C-statistic
Adjusted for censoring, Uno’s C measures the probability that among comparable subject pairs, the one diagnosed sooner had a higher risk score:
where is the Kaplan–Meier survival estimate for censoring and is the fitted risk score.
3.2 Time-Dependent ROC and AUC
For each time , time-dependent TPR and FPR are defined as functions of the risk score threshold :
The ROC curve at is then the locus in space, with corresponding AUC quantifying discriminative accuracy.
4. Comparative Model Performance
Performance on the ENROLL-HD dataset was benchmarked with censoring-adjusted Uno’s and time-dependent AUC (at 1 year, 3 years, and globally):
| Model | Uno's (yr) | Uno's (global) | AUC (yr) | AUC (global) |
|---|---|---|---|---|
| CAP | 0.88 | 0.80 | 0.84 | 0.79 |
| Langbehn | 0.87 | 0.80 | 0.83 | 0.78 |
| PIN | 0.90 | 0.84 | 0.86 | 0.82 |
| MRS | 0.91 | 0.86 | 0.88 | 0.85 |
All models exhibited strong risk stratification (random: 0.50). While the MRS model (incorporating additional covariates) achieved best-in-class accuracy, the CAP model performed closely and was more logistically feasible due to its reliance on only two routinely available predictors.
5. Clinical Trial Enrichment and Sample Size Calculations
The CAP model admits direct application in subject selection and trial power analysis. For a randomized controlled trial (RCT) of fixed duration , an enrichment threshold on the CAP score targets higher-risk participants. Letting
- (untreated event rate),
- (treatment event rate; relative reduction),
the required per-arm sample size under large-sample approximation is
or equivalently, under the proportional-hazards/log-rank approximation:
Example: For (3-year trial, ), , , : Earlier uncensored estimates underestimated sample size by 10–20% because high censoring biases event-rate estimation upward if not corrected.
6. Implementation, Parameterization, and Risk Thresholding
Software and Fitting
CAP can be fitted in R using:
1 |
survreg(Surv(time, event) ~ Age * (CAG + beta2), dist = "loglogistic") |
Recommended CAP Score Thresholds
| Trial Horizon () | Threshold | Diagnosis Rate |
|---|---|---|
| 2 years | 370.2 | 32% |
| 3 years | 370.2 | 37% |
| 4 years | 370.2 | 40% |
| 5 years | 368.5 | 41% |
These cutoffs are derived from ROC/Youden analysis on censoring-adjusted ENROLL-HD data.
Practical Notes
Using standard cutoffs derived from uncensored data (e.g., PREDICT-HD) for HD clinical trials leads to underpowered studies due to unadjusted upward bias in observed event rates under high censoring.
7. Context, Advantages, and Limitations
The CAP model’s main strengths lie in its parsimony, interpretability, and biological rationality. With only two required covariates, it is feasible in contexts lacking broader phenotypic or biomarker surveillance. It performs robustly when properly validated with metrics that adjust for heavy right-censoring. Its main limitation, reflected in comparative benchmarks, is a modest performance deficit against more complex models (e.g., PIN, MRS), which may be preferred if additional covariates and resources are available. Nonetheless, its logistic simplicity makes it suitable for preventative trials and clinical contexts requiring scalable, interpretable, and externally validated risk prediction using minimal, routinely collected data.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free