NPCNet: Deep Clustering for Sepsis Phenotyping

Updated 10 February 2026

NPCNet is a deep clustering network that encodes temporal EHR data as pseudo text to preserve sequence fidelity for sepsis phenotyping.
It integrates a navigator module that uses outcome labels to drive cluster formation, aligning phenotypes with clinical endpoints such as discharge status and mortality.
Evaluation shows NPCNet produces distinct phenotypes with improved clustering metrics and informs treatment timing decisions like vasopressor administration.

NPCNet (Navigator-Driven Pseudo Text for Deep Clustering of Early Sepsis Phenotyping) is a deep clustering network designed to uncover clinically actionable sepsis phenotypes directly from temporal electronic health records (EHR), overcoming limitations in prior clustering approaches that suffer from lossy aggregation, imputation artifacts, and lack of clinical outcome alignment. NPCNet introduces a pseudo-text encoding for temporal EHR variables and a navigator module to steer cluster formation toward clinical significance, enabling identification of distinct patient subgroups with divergent prognostic and treatment response profiles (Tsai et al., 3 Feb 2026).

1. Problem Motivation and Conceptual Framework

Sepsis presents as a heterogeneous clinical syndrome, exhibiting substantial variation in host response, progression, and outcomes. Previous patient clustering studies aggregate time-series EHR data to summary features, disregard measurement frequency and ordering, impose imputation for missingness (introducing bias), and lack outcome-driven guidance for cluster interpretability. NPCNet addresses these gaps by:

Encoding the complete sequence of time-stamped measurements ("pseudo text"), preserving temporal fidelity without imputation.
Injecting clinical relevance into clustering via a navigator module that incorporates outcome labels (e.g., discharge status) during training.

The primary objective is to achieve clusters (phenotypes) that are compact, well-separated, and maximally informative for prognosis and tailored interventions, such as vasopressor timing.

2. Model Architecture and Components

NPCNet comprises three major modules:

2.1 Text Embedding Generator

Binning and Tokenization: Each time-varying variable is mapped to $B$ quantile bins ( $\text{bin}(x)\in\{0,1,\dotsc,B-1\}$ ), forming tokens of the form "VARIABLE–BIN" ordered by timestamp into a pseudo text of length $l$ .
Embedding and Fusion:
- Token Embeddings: $P\in\mathbb{R}^{l\times d}$ retrieved via learned vocabulary.
- Order Encoding: $O\in\mathbb{R}^{l\times d}$ , standard positional encodings.
- Static Features: $S\in\mathbb{R}^d$ , summed category embeddings for demographics/comorbidities.
- Fusion Formula: $x = w \cdot (P + O) + (1-w) \cdot S$ with $w\in[0,1]$ balancing temporal and static contributions.

2.2 Deep Clustering Operator

Encoder $f_\theta$ and Decoder $g_\phi$ : Standard autoencoder over fused embeddings.
Latent Representations: $E\in\mathbb{R}^d$ for each patient.
Cluster Centroids and Assignments: $M\in\mathbb{R}^{k\times d}$ and $s_i\in\{0,1\}^k$ (one-hot).

2.3 Target Navigator

Two auxiliary supervision heads guide latent representations:

Probability Head: Linear-softmax layer predicts discharge status $y\in\{0,1\}$ (alive/dead).
Distance Head: Triplet sampling, penalizing decreased anchor-negative vs. anchor-positive (w.r.t. $y$ ) distances in embedding space.

3. Training Objectives and Optimization

NPCNet optimizes a multi-term loss:

Reconstruction Loss:

$\mathcal{L}_{\mathrm{rec}} = \sum_{i=1}^N \|x_i - \hat{x}_i\|_2^2$

Clustering Loss (k-means):

$\mathcal{L}_{\mathrm{clustering}} = \sum_{i=1}^N \|E_i - Ms_i\|_2^2, \;\; s_i \in \{0,1\}^k,\; \mathbf{1}^\top s_i = 1$

Navigator Loss (sum of):
- Focal Probability Loss:
$\mathcal{L}_{\mathrm{prob}} = -\frac{1}{N}\sum_{j=1}^N\sum_{i=1}^c w_i(1-p_{t,i}^j)^\gamma \log p_{t,i}^j$ - Triplet Distance Loss:

$\mathcal{L}_{\mathrm{dist}} = \max\left\{d(a,p) - d(a,n) + \text{margin},\, 0\right\}$ - Combined: $\mathcal{L}_{\mathrm{navigator}} = \kappa_1 \mathcal{L}_{\mathrm{prob}} + \kappa_2 \mathcal{L}_{\mathrm{dist}}$
Total:

$\mathcal{L} = \lambda_1 \mathcal{L}_{\mathrm{rec}} + \lambda_2 \mathcal{L}_{\mathrm{clustering}} + \lambda_3 \mathcal{L}_{\mathrm{navigator}}$

Alternating optimization steps update autoencoder parameters and perform k-means clustering in latent space. The navigator acts as an outcome-driven regularizer, shaping cluster geometry beyond intra-cluster spread minimization.

4. Deep Clustering Methodology and Phenotype Identification

Clustering is conducted in the learned embedding space with four clusters ( $k=4$ ), yielding phenotypes $\alpha, \beta, \gamma, \delta$ . The identified groups exhibit:

$\alpha$ : Younger, few comorbidities, lab derangements but rapidly improving SOFA, 1.4% in-hospital mortality.
$\beta$ : Moderate lab/inflammatory abnormalities, intermediate risk, 8.1% mortality.
$\gamma$ : Inflammatory and renal dysfunction, moderate SOFA worsening, 21.3% mortality.
$\delta$ : Older, high comorbidity, highest early SOFA (median 11), 46.2% in-hospital, 70.3% one-year mortality.

SOFA (Sequential Organ Failure Assessment) trajectories, computed hourly over 7–24 hours post-ICU admission, reveal marked divergence among the phenotypes. Significance of divergence is measured by the Trajectory Divergence Index (TDI):

$\text{TDI} = \frac{\text{\# of pairwise phenotype comparisons with significant GAMM difference}}{\text{total \# of pairs} \times \text{time points}}$

NPCNet achieved the highest TDI among methods evaluated, indicating distinct trajectory separation (Tsai et al., 3 Feb 2026).

5. Evaluation, Ablation, and Comparative Results

The model was benchmarked using both internal clustering metrics and clinical outcomes:

Metric	NPCNet	DCN	Nesti-Net	PCPNet	PCA
Silhouette Index (SI)	0.447	0.344	—	—	—
Calinski–Harabasz Index	2.051	1.482	—	—	—
Davies–Bouldin Index	0.670	1.140	—	—	—

Kaplan–Meier survival curves confirmed ordered risk across phenotypes $(\alpha<\beta<\gamma<\delta)$ .
Ablation analyses demonstrated superiority of binning plus order encoding, Transformer-based pseudo text backbones, navigator with dual losses, and the use of discharge status as the navigator target.

6. Treatment Effect Analysis and Clinical Implications

A treatment effect study in 2,013 patients receiving norepinephrine for early hypotension investigated interactions between NPCNet phenotypes and responses to vasopressor timing and fluid volume:

Fluid Volume: No significant effect on mortality in any phenotype group.
Time to Vasopressor: Each 1 h delay increased odds of death by 31.8% in $\alpha$ , 15.6% in $\beta$ , and 16.0% in $\delta$ ; no effect in $\gamma$ . Sensitivity analysis with E-values (1.36–1.56) accounts for unmeasured confounding.
These findings suggest that phenotype assignment via NPCNet has implications for precision timing of vasopressor therapy.

7. Limitations and Future Directions

NPCNet currently leverages conventional labs and vitals, omitting multi-omics and advanced biomarkers. The TDI measures only the existence but not magnitude of trajectory separation. The retrospective study design may be affected by timestamp artifacts, and intervention analysis is limited to vasopressor-exposed patients. Future research directions include expansion to heterogeneous data (e.g., genomics), advanced metrics of clinical divergence, and prospective trials to validate phenotype-guided treatment interventions (Tsai et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

NPCNet: Navigator-Driven Pseudo Text for Deep Clustering of Early Sepsis Phenotyping (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NPCNet.