Bayesian Latent Class Framework for EHR Phenotyping
- The Bayesian latent class framework is a probabilistic model that identifies hidden subpopulations in heterogeneous biomedical datasets using informative priors and explicit modeling of data missingness.
- It employs tailored likelihood functions for continuous and binary variables, modeling missing not at random (MNAR) scenarios to reveal clinically meaningful phenotypes such as the T2-high subtype in asthma.
- Using Hamiltonian Monte Carlo for posterior inference, the framework robustly quantifies uncertainty in class assignments, supporting precise cohort discovery and evidence-driven precision medicine.
A Bayesian latent class framework is a probabilistic modeling approach for identifying and characterizing latent (unobserved) subpopulations within heterogeneous biomedical cohorts, with explicit incorporation of prior knowledge, principled handling of uncertainty, and quantitative accounting for informative missingness. This methodology provides a coherent strategy for hypothesis generation and cohort identification in complex electronic health record (EHR) datasets, supporting the discovery of clinically meaningful phenotypes and subtypes, particularly where the true disease or trait stratification is unknown.
1. Structural Specification of the Bayesian Latent Class Model
The framework models each subject as belonging to one of latent classes, with a latent indicator variable in the case of a binary class system. For each patient, three tiers of data are observed: baseline covariates (e.g., demographic variables), continuous clinical features (biomarkers, utilization, etc.), and binary features (e.g., diagnosis codes, binary test outcomes). Missing data in continuous features are captured by missingness indicators . The joint likelihood—integrating over both data and missingness—is defined as:
where represents the class prior, a missing not at random (MNAR) logistic model, denotes the normal density for continuous features, and is a Bernoulli likelihood for binary features. The key design is the explicit joint modeling of missingness, capturing patterns that may be informative about class membership.
2. Incorporation of Informative Priors and Domain Knowledge
A distinguishing feature is the direct encoding of clinical prior knowledge into parameter priors, especially for features known to define critical disease axes. For example, class means and slopes for Type 2 (T2) inflammation-associated biomarkers were assigned informative priors to bias inference toward clinically meaningful separation.
- For blood eosinophil count (EOS), the class 0 (non-T2) intercept is given a tight normal prior, , and the T2 slope is to target empirical separation at the 0.1510³/μL clinical cut-point, achieving sensitivity 75% and specificity 76%.
- Skin-prick test positivity (PST) and allergy ICD code indicators were assigned truncated normal priors to enforce high specificity or concordance with observed cohort prevalence.
For all other features (utilization rates, medication prescriptions), weakly informative priors were used, allowing the data to determine their role in class separation. This differential weighting embeds clinical semantics into the unsupervised learning process, steering the posterior toward interpretable phenotypic axes.
3. Explicit Modeling of Data Missingness
The missingness mechanism for each continuous feature is modeled as:
This approach enables the model to learn, for instance, that laboratory measurements might be more frequently missing in one latent class (as often occurs in EHR cohorts), directly influencing the posterior assignment of . Integration over missing features is realized analytically within the likelihood, adhering to the assumption that absence of data itself harbors signal about latent state.
4. Posterior Inference and Class Assignment
Posterior exploration is performed using Hamiltonian Monte Carlo (Stan NUTS), marginalizing the latent class variable in the log-likelihood. Good posterior mixing is indicated by convergence diagnostics ( for all parameters, effective sample sizes ). In the application to 44,642 adult asthma patients, the framework yielded a posterior distribution of that was bimodal, with density clustering near 0.03 and 0.98, signifying robust class separation.
The population fraction for the T2-inflammation-informed class (posterior mean) was 38.7%, with a 95% credible interval for the mixing weight of (37.9%, 39.5%). Under a hard threshold at , 36.3% of patients were classified as T2-high.
5. Phenotype Discovery and Clinical Characterization
The class characterized by informative priors for T2-inflammation ("T2-informed" class, ) exhibited:
- Markedly elevated blood eosinophils (mean standardized score for vs. for )
- Higher total IgE (median back-transformed $245.2$ kU/L vs. $98.7$ for non-T2)
- Higher odds of skin-prick positivity ( vs. ), allergy ICD codes ( vs. ), and ICS/LABA prescriptions ( vs. )
- Greater health-care utilization: asthma encounters per year ($4.06$ vs. $1.80$), ED visits ($0.18$ vs. $0.05$), SABA prescriptions ($3.12$ vs. $1.67$)
Despite weak priors on utilization features, the latent class structure identified an "uncontrolled T2-high" sub-phenotype that accumulates both classic biomarker elevations and high clinical burden, aligning with evidence-based treatment stratification for biologic therapies. This suggests that even in the absence of direct prior on utilization, the integration of robust biomarker priors can propagate to clinically meaningful secondary axes.
| Feature | (Non-T2) | (T2-informed) |
|---|---|---|
| Eosinophils (std) | () | () |
| Total IgE (kU/L) | $98.7$ () | $245.2$ () |
| Asthma encounters/year | $1.80$ () | $4.06$ () |
| ICS/LABA Rx (\%) | () | () |
6. Interpretability, Uncertainty, and Cohort Assignment
The Bayesian framework supports probabilistic patient-level class assignment, yielding interpretable and uncertainty-quantified cohort demarcations. This is fundamentally different from hard-threshold rule-based phenotyping or ad hoc clustering: each individual receives a posterior probability of latent class membership, which can be thresholded for high-purity cohort selection (), used in downstream clinical decision support, or leveraged for hypothesis generation (e.g., identifying new utilization phenotypes within T2-high patients).
A plausible implication is the suitability of this approach for precision medicine studies, clinical trial enrichment, and algorithmic clinical pathway development in diseases such as asthma, where gold-standard endotypes are not universally established.
7. Implications and Extensions for Biomedical Research
Embedding expert priors into a flexible latent class machinery enables the simultaneous discovery and validation of clinically plausible phenotypes directly from large EHR datasets. Explicit modeling of informative missingness enables use in real-world data where measurement patterns often convey diagnostic cues. The framework’s operationalization in a general Bayesian engine (Stan NUTS) allows for extension to richer data types (e.g., time-series, multi-class settings), varied outcome spaces, and joint modeling with supervised endpoints.
In summary, the Bayesian latent class framework formalizes unsupervised, prior-informed clustering with uncertainty quantification, interpretable phenotype recovery, and robust handling of missing and heterogeneous clinical data. Its application to asthma EHRs demonstrates identification of a well-separated, clinically interpretable "uncontrolled T2-high" phenotype, supporting both comprehensive cohort discovery and evidence-driven hypothesis generation (Mayer et al., 3 Nov 2025).