CAP Model for HD Diagnosis Prediction

Updated 10 November 2025

The CAP model is a two-covariate accelerated-failure-time model that predicts time to Huntington disease diagnosis by combining age and HTT CAG repeat length.
It employs a biologically motivated CAP score, formulated as Age × (CAG + β2), ensuring simplicity and reliable discrimination under heavy censoring.
The model facilitates clinical trial enrichment and sample size estimation by providing interpretable risk thresholds for patient selection.

The CAG-Age Product (CAP) model is a parametric, two-covariate accelerated-failure-time (AFT) model designed for predicting time to clinical diagnosis in Huntington disease (HD). It combines genetic information (HTT CAG repeat length) and demographic data (age at enrollment) in a biologically motivated product form, delivering a simple yet robust prognostic framework with direct applications in clinical risk stratification, trial enrichment, and paper design.

1. Mathematical Formulation and Model Specification

The CAP model specifies time to diagnosis, $X_i$ , for subject $i$ as follows:

$\log(X_i) = \beta_0 + \beta_1\,\bigl[\mathrm{Age}_i\,(\mathrm{CAG}_i + \beta_2)\bigr] + \varepsilon_i$

where:

$X_i$ : years from paper enrollment to HD diagnosis,
$\mathrm{Age}_i$ : age at enrollment,
$\mathrm{CAG}_i$ : HTT CAG repeat length,
$\varepsilon_i \sim \mathrm{Logistic}(0, \sigma)$ : model error term.

The model introduces a "CAP score" $S_i^{\rm CAP}$ : $S_i^{\rm CAP} = \mathrm{Age}_i\,(\mathrm{CAG}_i + \beta_2)$ so that the model reduces to: $\log(X_i) = \beta_0 + \beta_1\,S_i^{\rm CAP} + \varepsilon_i$

Parameter interpretations:

$\beta_0$ (intercept): expected $\log(X)$ when $S_i^{\rm CAP}=0$ .
$\beta_1$ (scale): controls the compressive effect of $S_i^{\rm CAP}$ on time to diagnosis.
$\beta_2$ (centering constant): shifts the CAG axis to set an interpretable reference value and improve model fit.

Standard published estimates are $\widehat\beta_0\approx 4.35$ , $\widehat\beta_1\approx -0.025$ , $\widehat\beta_2\approx -33.0$ .

2. Theoretical Underpinnings and Assumptions

The CAP model is developed under several key assumptions:

AFT structure: Covariates act multiplicatively on the time scale, which is suitable for progressive disorders like HD.
Parametric error: Uses a logistic error term, yielding a log-logistic marginal distribution for $X_i$ .
Independent censoring: Censoring is conditionally independent of diagnostic time, given age and CAG.
Model selection: Zhang et al. (2011) conducted model comparisons across main effects and interaction models using AIC and prediction error, with the log-logistic AFT interaction form (the CAP model) emerging as optimal.

This model’s simplicity contrasts with multivariate approaches while preserving interpretability and biological relevance.

3. Validation and Performance Metrics under Censoring

Validation of the CAP model leverages external ENROLL-HD data and metrics robust to high right-censoring (approx. 78%). Core metrics include:

3.1 Uno’s C-statistic

Adjusted for censoring, Uno’s C measures the probability that among comparable subject pairs, the one diagnosed sooner had a higher risk score:

$\widehat C_{\rm Uno}(\tau) = \frac{ \sum_{i\ne j} \Delta_i\,\widehat G(W_i)^{-2}\, \mathbf1\{W_i < W_j,\, W_i < \tau\}\, \mathbf1\{\widehat\eta_i > \widehat\eta_j\} }{ \sum_{i\ne j} \Delta_i\,\widehat G(W_i)^{-2}\, \mathbf1\{W_i < W_j,\, W_i < \tau\} }$

where $\widehat G$ is the Kaplan–Meier survival estimate for censoring and $\widehat\eta_i$ is the fitted risk score.

3.2 Time-Dependent ROC and AUC

For each time $t$ , time-dependent TPR and FPR are defined as functions of the risk score threshold $q$ : $\widehat{\rm TPR}(q, t) = \frac{1-\widehat S(t\mid\widehat\eta_i > q)\;\times\;\widehat{\Pr}(\widehat\eta_i > q)}{1-\widehat S(t)}$

$\widehat{\rm FPR}(q, t) = 1-\frac{\widehat S(t\mid\widehat\eta_i \le q)\;\times\;\widehat{\Pr}(\widehat\eta_i \le q)}{\widehat S(t)}$

The ROC curve at $t$ is then the locus in $(\widehat{\rm FPR}, \widehat{\rm TPR})$ space, with corresponding AUC quantifying discriminative accuracy.

4. Comparative Model Performance

Performance on the ENROLL-HD dataset was benchmarked with censoring-adjusted Uno’s $C$ and time-dependent AUC (at 1 year, 3 years, and globally):

Model	Uno's $C$ ( $\tau=1$ yr)	Uno's $C$ (global)	AUC ( $\tau=3$ yr)	AUC (global)
CAP	0.88	0.80	0.84	0.79
Langbehn	0.87	0.80	0.83	0.78
PIN	0.90	0.84	0.86	0.82
MRS	0.91	0.86	0.88	0.85

All models exhibited strong risk stratification (random: 0.50). While the MRS model (incorporating additional covariates) achieved best-in-class accuracy, the CAP model performed closely and was more logistically feasible due to its reliance on only two routinely available predictors.

5. Clinical Trial Enrichment and Sample Size Calculations

The CAP model admits direct application in subject selection and trial power analysis. For a randomized controlled trial (RCT) of fixed duration $t$ , an enrichment threshold $q^*$ on the CAP score $S_i^{\rm CAP}$ targets higher-risk participants. Letting

$\pi_0 = \Pr(X \le t \mid S_i^{\rm CAP}\ge q^*)$ (untreated event rate),
$\pi_1 = (1-\delta)\pi_0$ (treatment event rate; $\delta =$ relative reduction),

the required per-arm sample size under large-sample approximation is

$n = \frac{ \bigl[ Z_{1-\alpha/2}\sqrt{\pi_0(1-\pi_0)+\pi_1(1-\pi_1)} + Z_{1-\beta}\sqrt{\pi_0(1-\pi_0)+\pi_1(1-\pi_1)} \bigr]^2 }{ (\pi_0-\pi_1)^2 }$

or equivalently, under the proportional-hazards/log-rank approximation: $D = \frac{(Z_{1-\alpha/2} + Z_{1-\beta})^2}{\{\ln(1-\delta)\}^2}, \quad n \approx D / \pi_0$

Example: For $q^* \approx 370.2$ (3-year trial, $\pi_0\approx0.37$ ), $\delta=0.5$ , $\alpha=0.05$ , $1-\beta=0.8$ : $n \approx \frac{(1.96+0.84)^2}{[\ln(0.5)]^2} \bigg/ 0.37 \approx 92$ Earlier uncensored estimates underestimated sample size by 10–20% because high censoring biases event-rate estimation upward if not corrected.

6. Implementation, Parameterization, and Risk Thresholding

Software and Fitting

CAP can be fitted in R using:

1	survreg(Surv(time, event) ~ Age * (CAG + beta2), dist = "loglogistic")

with

\widehat\beta_2

conventionally set to

-33.0

. Risk scores are calculated as

S_i^{\rm CAP} = \mathrm{Age}_i\,(\mathrm{CAG}_i - 33.0)

Recommended CAP Score Thresholds

Trial Horizon ( $t$ )	Threshold $q^*$	Diagnosis Rate
2 years	370.2	32%
3 years	370.2	37%
4 years	370.2	40%
5 years	368.5	41%

These cutoffs are derived from ROC/Youden analysis on censoring-adjusted ENROLL-HD data.

Practical Notes

Using standard cutoffs derived from uncensored data (e.g., PREDICT-HD) for HD clinical trials leads to underpowered studies due to unadjusted upward bias in observed event rates under high censoring.

7. Context, Advantages, and Limitations

The CAP model’s main strengths lie in its parsimony, interpretability, and biological rationality. With only two required covariates, it is feasible in contexts lacking broader phenotypic or biomarker surveillance. It performs robustly when properly validated with metrics that adjust for heavy right-censoring. Its main limitation, reflected in comparative benchmarks, is a modest performance deficit against more complex models (e.g., PIN, MRS), which may be preferred if additional covariates and resources are available. Nonetheless, its logistic simplicity makes it suitable for preventative trials and clinical contexts requiring scalable, interpretable, and externally validated risk prediction using minimal, routinely collected data.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to CAG-Age Product (CAP) Model.