Bayesian Latent Class Framework for EHR Phenotyping

Updated 9 November 2025

The Bayesian latent class framework is a probabilistic model that identifies hidden subpopulations in heterogeneous biomedical datasets using informative priors and explicit modeling of data missingness.
It employs tailored likelihood functions for continuous and binary variables, modeling missing not at random (MNAR) scenarios to reveal clinically meaningful phenotypes such as the T2-high subtype in asthma.
Using Hamiltonian Monte Carlo for posterior inference, the framework robustly quantifies uncertainty in class assignments, supporting precise cohort discovery and evidence-driven precision medicine.

A Bayesian latent class framework is a probabilistic modeling approach for identifying and characterizing latent (unobserved) subpopulations within heterogeneous biomedical cohorts, with explicit incorporation of prior knowledge, principled handling of uncertainty, and quantitative accounting for informative missingness. This methodology provides a coherent strategy for hypothesis generation and cohort identification in complex electronic health record (EHR) datasets, supporting the discovery of clinically meaningful phenotypes and subtypes, particularly where the true disease or trait stratification is unknown.

1. Structural Specification of the Bayesian Latent Class Model

The framework models each subject $i=1,\ldots,N$ as belonging to one of $K$ latent classes, with a latent indicator variable $z_i\in\{0,1\}$ in the case of a binary class system. For each patient, three tiers of data are observed: baseline covariates $X_i$ (e.g., demographic variables), continuous clinical features $Y_i$ (biomarkers, utilization, etc.), and binary features $W_i$ (e.g., diagnosis codes, binary test outcomes). Missing data in continuous features are captured by missingness indicators $R_i$ . The joint likelihood—integrating over both data and missingness—is defined as:

$P(Y_i, R_i, W_i, z_i \mid \Theta) = P(z_i \mid \pi) \prod_{j=1}^J P(R_{ij} \mid z_i, X_i, \beta^R_j) \left[ f_N(Y_{ij} \mid \mu_j(z_i), \sigma_j^2) \right]^{1 - R_{ij}} \prod_{k=1}^K P(W_{ik} \mid z_i, \beta^W_k)$

where $P(z_i|\pi)$ represents the class prior, $P(R_{ij} \mid \ldots)$ a missing not at random (MNAR) logistic model, $f_N$ denotes the normal density for continuous features, and $P(W_{ik} \mid \ldots)$ is a Bernoulli likelihood for binary features. The key design is the explicit joint modeling of missingness, capturing patterns that may be informative about class membership.

2. Incorporation of Informative Priors and Domain Knowledge

A distinguishing feature is the direct encoding of clinical prior knowledge into parameter priors, especially for features known to define critical disease axes. For example, class means and slopes for Type 2 (T2) inflammation-associated biomarkers were assigned informative priors to bias inference toward clinically meaningful separation.

For blood eosinophil count (EOS), the class 0 (non-T2) intercept is given a tight normal prior, $N(-0.68, 0.05^2)$ , and the T2 slope is $N(0.70 - \text{intercept}, 0.05^2)$ to target empirical separation at the 0.15 $\times$ 10³/μL clinical cut-point, achieving sensitivity $\approx$ 75% and specificity $\approx$ 76%.
Skin-prick test positivity (PST) and allergy ICD code indicators were assigned truncated normal priors to enforce high specificity or concordance with observed cohort prevalence.

For all other features (utilization rates, medication prescriptions), weakly informative $N(0,1)$ priors were used, allowing the data to determine their role in class separation. This differential weighting embeds clinical semantics into the unsupervised learning process, steering the posterior toward interpretable phenotypic axes.

3. Explicit Modeling of Data Missingness

The missingness mechanism for each continuous feature $j$ is modeled as:

$P(R_{ij} = 1 \mid z_i, X_i) = \mathrm{logit}^{-1}( \alpha^{(R)}_{0j} + \alpha^{(R)}_{1j} z_i + \gamma_j^T X_i )$

This approach enables the model to learn, for instance, that laboratory measurements might be more frequently missing in one latent class (as often occurs in EHR cohorts), directly influencing the posterior assignment of $z_i$ . Integration over missing features is realized analytically within the likelihood, adhering to the assumption that absence of data itself harbors signal about latent state.

4. Posterior Inference and Class Assignment

Posterior exploration is performed using Hamiltonian Monte Carlo (Stan NUTS), marginalizing the latent class variable in the log-likelihood. Good posterior mixing is indicated by convergence diagnostics ( $\hat{R}<1.1$ for all parameters, effective sample sizes $\gtrsim 1,000$ ). In the application to 44,642 adult asthma patients, the framework yielded a posterior distribution of $P(z_i=1\mid\text{data})$ that was bimodal, with density clustering near 0.03 and 0.98, signifying robust class separation.

The population fraction for the T2-inflammation-informed class (posterior mean) was 38.7%, with a 95% credible interval for the mixing weight $\pi$ of (37.9%, 39.5%). Under a hard threshold at $P(z_i=1)\geq 0.5$ , 36.3% of patients were classified as T2-high.

5. Phenotype Discovery and Clinical Characterization

The class characterized by informative priors for T2-inflammation ("T2-informed" class, $z=1$ ) exhibited:

Markedly elevated blood eosinophils (mean standardized score $+0.22$ for $z=1$ vs. $-0.18$ for $z=0$ )
Higher total IgE (median back-transformed $245.2$ kU/L vs. $98.7$ for non-T2)
Higher odds of skin-prick positivity ( $4.0\%$ vs. $0.1\%$ ), allergy ICD codes ( $62.6\%$ vs. $23.4\%$ ), and ICS/LABA prescriptions ( $85.9\%$ vs. $42.1\%$ )
Greater health-care utilization: asthma encounters per year ($4.06$ vs. $1.80$), ED visits ($0.18$ vs. $0.05$), SABA prescriptions ($3.12$ vs. $1.67$)

Despite weak priors on utilization features, the latent class structure identified an "uncontrolled T2-high" sub-phenotype that accumulates both classic biomarker elevations and high clinical burden, aligning with evidence-based treatment stratification for biologic therapies. This suggests that even in the absence of direct prior on utilization, the integration of robust biomarker priors can propagate to clinically meaningful secondary axes.

Feature	$z=0$ (Non-T2)	$z=1$ (T2-informed)
Eosinophils (std)	$-0.18$ ( $[-0.19,-0.17]$ )	$+0.22$ ( $[0.21,0.23]$ )
Total IgE (kU/L)	$98.7$ ( $[92.3,105.4]$ )	$245.2$ ( $[237.1,253.7]$ )
Asthma encounters/year	$1.80$ ( $[1.77,1.83]$ )	$4.06$ ( $[3.95,4.17]$ )
ICS/LABA Rx (\%)	$42.1\%$ ( $[41.7,42.5]$ )	$85.9\%$ ( $[85.5,86.3]$ )

6. Interpretability, Uncertainty, and Cohort Assignment

The Bayesian framework supports probabilistic patient-level class assignment, yielding interpretable and uncertainty-quantified cohort demarcations. This is fundamentally different from hard-threshold rule-based phenotyping or ad hoc clustering: each individual receives a posterior probability of latent class membership, which can be thresholded for high-purity cohort selection ( $P(z_i=1)\geq0.9$ ), used in downstream clinical decision support, or leveraged for hypothesis generation (e.g., identifying new utilization phenotypes within T2-high patients).

A plausible implication is the suitability of this approach for precision medicine studies, clinical trial enrichment, and algorithmic clinical pathway development in diseases such as asthma, where gold-standard endotypes are not universally established.

7. Implications and Extensions for Biomedical Research

Embedding expert priors into a flexible latent class machinery enables the simultaneous discovery and validation of clinically plausible phenotypes directly from large EHR datasets. Explicit modeling of informative missingness enables use in real-world data where measurement patterns often convey diagnostic cues. The framework’s operationalization in a general Bayesian engine (Stan NUTS) allows for extension to richer data types (e.g., time-series, multi-class settings), varied outcome spaces, and joint modeling with supervised endpoints.

In summary, the Bayesian latent class framework formalizes unsupervised, prior-informed clustering with uncertainty quantification, interpretable phenotype recovery, and robust handling of missing and heterogeneous clinical data. Its application to asthma EHRs demonstrates identification of a well-separated, clinically interpretable "uncontrolled T2-high" phenotype, supporting both comprehensive cohort discovery and evidence-driven hypothesis generation (Mayer et al., 3 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Enhancing Phenotype Discovery in Electronic Health Records through Prior Knowledge-Guided Unsupervised Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bayesian Latent Class Framework.