Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 46 tok/s Pro
Gemini 2.5 Flash 148 tok/s Pro
Kimi K2 170 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

CAP Model for HD Diagnosis Prediction

Updated 10 November 2025
  • The CAP model is a two-covariate accelerated-failure-time model that predicts time to Huntington disease diagnosis by combining age and HTT CAG repeat length.
  • It employs a biologically motivated CAP score, formulated as Age × (CAG + β2), ensuring simplicity and reliable discrimination under heavy censoring.
  • The model facilitates clinical trial enrichment and sample size estimation by providing interpretable risk thresholds for patient selection.

The CAG-Age Product (CAP) model is a parametric, two-covariate accelerated-failure-time (AFT) model designed for predicting time to clinical diagnosis in Huntington disease (HD). It combines genetic information (HTT CAG repeat length) and demographic data (age at enrollment) in a biologically motivated product form, delivering a simple yet robust prognostic framework with direct applications in clinical risk stratification, trial enrichment, and paper design.

1. Mathematical Formulation and Model Specification

The CAP model specifies time to diagnosis, XiX_i, for subject ii as follows:

log(Xi)=β0+β1[Agei(CAGi+β2)]+εi\log(X_i) = \beta_0 + \beta_1\,\bigl[\mathrm{Age}_i\,(\mathrm{CAG}_i + \beta_2)\bigr] + \varepsilon_i

where:

  • XiX_i: years from paper enrollment to HD diagnosis,
  • Agei\mathrm{Age}_i: age at enrollment,
  • CAGi\mathrm{CAG}_i: HTT CAG repeat length,
  • εiLogistic(0,σ)\varepsilon_i \sim \mathrm{Logistic}(0, \sigma): model error term.

The model introduces a "CAP score" SiCAPS_i^{\rm CAP}: SiCAP=Agei(CAGi+β2)S_i^{\rm CAP} = \mathrm{Age}_i\,(\mathrm{CAG}_i + \beta_2) so that the model reduces to: log(Xi)=β0+β1SiCAP+εi\log(X_i) = \beta_0 + \beta_1\,S_i^{\rm CAP} + \varepsilon_i

Parameter interpretations:

  • β0\beta_0 (intercept): expected log(X)\log(X) when SiCAP=0S_i^{\rm CAP}=0.
  • β1\beta_1 (scale): controls the compressive effect of SiCAPS_i^{\rm CAP} on time to diagnosis.
  • β2\beta_2 (centering constant): shifts the CAG axis to set an interpretable reference value and improve model fit.

Standard published estimates are β^04.35\widehat\beta_0\approx 4.35, β^10.025\widehat\beta_1\approx -0.025, β^233.0\widehat\beta_2\approx -33.0.

2. Theoretical Underpinnings and Assumptions

The CAP model is developed under several key assumptions:

  • AFT structure: Covariates act multiplicatively on the time scale, which is suitable for progressive disorders like HD.
  • Parametric error: Uses a logistic error term, yielding a log-logistic marginal distribution for XiX_i.
  • Independent censoring: Censoring is conditionally independent of diagnostic time, given age and CAG.
  • Model selection: Zhang et al. (2011) conducted model comparisons across main effects and interaction models using AIC and prediction error, with the log-logistic AFT interaction form (the CAP model) emerging as optimal.

This model’s simplicity contrasts with multivariate approaches while preserving interpretability and biological relevance.

3. Validation and Performance Metrics under Censoring

Validation of the CAP model leverages external ENROLL-HD data and metrics robust to high right-censoring (approx. 78%). Core metrics include:

3.1 Uno’s C-statistic

Adjusted for censoring, Uno’s C measures the probability that among comparable subject pairs, the one diagnosed sooner had a higher risk score:

C^Uno(τ)=ijΔiG^(Wi)21{Wi<Wj,Wi<τ}1{η^i>η^j}ijΔiG^(Wi)21{Wi<Wj,Wi<τ}\widehat C_{\rm Uno}(\tau) = \frac{ \sum_{i\ne j} \Delta_i\,\widehat G(W_i)^{-2}\, \mathbf1\{W_i < W_j,\, W_i < \tau\}\, \mathbf1\{\widehat\eta_i > \widehat\eta_j\} }{ \sum_{i\ne j} \Delta_i\,\widehat G(W_i)^{-2}\, \mathbf1\{W_i < W_j,\, W_i < \tau\} }

where G^\widehat G is the Kaplan–Meier survival estimate for censoring and η^i\widehat\eta_i is the fitted risk score.

3.2 Time-Dependent ROC and AUC

For each time tt, time-dependent TPR and FPR are defined as functions of the risk score threshold qq: TPR^(q,t)=1S^(tη^i>q)  ×  Pr^(η^i>q)1S^(t)\widehat{\rm TPR}(q, t) = \frac{1-\widehat S(t\mid\widehat\eta_i > q)\;\times\;\widehat{\Pr}(\widehat\eta_i > q)}{1-\widehat S(t)}

FPR^(q,t)=1S^(tη^iq)  ×  Pr^(η^iq)S^(t)\widehat{\rm FPR}(q, t) = 1-\frac{\widehat S(t\mid\widehat\eta_i \le q)\;\times\;\widehat{\Pr}(\widehat\eta_i \le q)}{\widehat S(t)}

The ROC curve at tt is then the locus in (FPR^,TPR^)(\widehat{\rm FPR}, \widehat{\rm TPR}) space, with corresponding AUC quantifying discriminative accuracy.

4. Comparative Model Performance

Performance on the ENROLL-HD dataset was benchmarked with censoring-adjusted Uno’s CC and time-dependent AUC (at 1 year, 3 years, and globally):

Model Uno's CC (τ=1\tau=1yr) Uno's CC (global) AUC (τ=3\tau=3yr) AUC (global)
CAP 0.88 0.80 0.84 0.79
Langbehn 0.87 0.80 0.83 0.78
PIN 0.90 0.84 0.86 0.82
MRS 0.91 0.86 0.88 0.85

All models exhibited strong risk stratification (random: 0.50). While the MRS model (incorporating additional covariates) achieved best-in-class accuracy, the CAP model performed closely and was more logistically feasible due to its reliance on only two routinely available predictors.

5. Clinical Trial Enrichment and Sample Size Calculations

The CAP model admits direct application in subject selection and trial power analysis. For a randomized controlled trial (RCT) of fixed duration tt, an enrichment threshold qq^* on the CAP score SiCAPS_i^{\rm CAP} targets higher-risk participants. Letting

  • π0=Pr(XtSiCAPq)\pi_0 = \Pr(X \le t \mid S_i^{\rm CAP}\ge q^*) (untreated event rate),
  • π1=(1δ)π0\pi_1 = (1-\delta)\pi_0 (treatment event rate; δ=\delta = relative reduction),

the required per-arm sample size under large-sample approximation is

n=[Z1α/2π0(1π0)+π1(1π1)+Z1βπ0(1π0)+π1(1π1)]2(π0π1)2n = \frac{ \bigl[ Z_{1-\alpha/2}\sqrt{\pi_0(1-\pi_0)+\pi_1(1-\pi_1)} + Z_{1-\beta}\sqrt{\pi_0(1-\pi_0)+\pi_1(1-\pi_1)} \bigr]^2 }{ (\pi_0-\pi_1)^2 }

or equivalently, under the proportional-hazards/log-rank approximation: D=(Z1α/2+Z1β)2{ln(1δ)}2,nD/π0D = \frac{(Z_{1-\alpha/2} + Z_{1-\beta})^2}{\{\ln(1-\delta)\}^2}, \quad n \approx D / \pi_0

Example: For q370.2q^* \approx 370.2 (3-year trial, π00.37\pi_0\approx0.37), δ=0.5\delta=0.5, α=0.05\alpha=0.05, 1β=0.81-\beta=0.8: n(1.96+0.84)2[ln(0.5)]2/0.3792n \approx \frac{(1.96+0.84)^2}{[\ln(0.5)]^2} \bigg/ 0.37 \approx 92 Earlier uncensored estimates underestimated sample size by 10–20% because high censoring biases event-rate estimation upward if not corrected.

6. Implementation, Parameterization, and Risk Thresholding

Software and Fitting

CAP can be fitted in R using:

1
survreg(Surv(time, event) ~ Age * (CAG + beta2), dist = "loglogistic")
with β^2\widehat\beta_2 conventionally set to 33.0-33.0. Risk scores are calculated as SiCAP=Agei(CAGi33.0)S_i^{\rm CAP} = \mathrm{Age}_i\,(\mathrm{CAG}_i - 33.0).

Trial Horizon (tt) Threshold qq^* Diagnosis Rate
2 years 370.2 32%
3 years 370.2 37%
4 years 370.2 40%
5 years 368.5 41%

These cutoffs are derived from ROC/Youden analysis on censoring-adjusted ENROLL-HD data.

Practical Notes

Using standard cutoffs derived from uncensored data (e.g., PREDICT-HD) for HD clinical trials leads to underpowered studies due to unadjusted upward bias in observed event rates under high censoring.

7. Context, Advantages, and Limitations

The CAP model’s main strengths lie in its parsimony, interpretability, and biological rationality. With only two required covariates, it is feasible in contexts lacking broader phenotypic or biomarker surveillance. It performs robustly when properly validated with metrics that adjust for heavy right-censoring. Its main limitation, reflected in comparative benchmarks, is a modest performance deficit against more complex models (e.g., PIN, MRS), which may be preferred if additional covariates and resources are available. Nonetheless, its logistic simplicity makes it suitable for preventative trials and clinical contexts requiring scalable, interpretable, and externally validated risk prediction using minimal, routinely collected data.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CAG-Age Product (CAP) Model.