GPT-PPG: Transformer for Photoplethysmography

Updated 2 June 2026

GPT-PPG is a generative transformer designed for continuous photoplethysmography signals, offering improved clinical prediction and signal reconstruction.
It employs an autoregressive, patch-based architecture that processes 30-second PPG segments to capture structured temporal patterns effectively.
The model supports versatile applications including arrhythmia detection, heart rate estimation, and zero-shot denoising, while incorporating bias-aware fine-tuning strategies.

GPT-PPG (Generative Pre-trained Transformer for Photoplethysmography) refers to a class of large-scale, generative, transformer-based foundation models tailored for photoplethysmography (PPG) signals. These models adapt autoregressive transformer architectures—originally developed for natural language—to model the continuous, highly structured temporal data represented by PPG waveforms, enabling general-purpose signal understanding, downstream health prediction, and generative tasks such as denoising or inpainting. GPT-PPG models have established new benchmarks for PPG representation learning, demonstrating versatility across clinical and wearable domains.

1. Model Architecture and Pretraining Paradigm

GPT-PPG adopts a unidirectional, decoder-only transformer design, with modifications suited for processing continuous-valued biosignals. Each 30 s PPG segment (acquired at 40 Hz; 1200 samples) is divided into 30 non-overlapping “patches” (1 s, 40 samples). Each patch $x_i\in\mathbb{R}^{40}$ is linearly embedded into a $d$ -dimensional vector $h_i\in\mathbb{R}^d$ . A learnable start-of-sequence vector $h_s$ is prepended. These embeddings are processed by $N$ stacked transformer decoder layers with rotary positional encoding and RMS normalization; causal self-attention enforces autoregressive dependence, ensuring $p(x_i|x_{<i})$ semantics. Each transformer layer uses multi-head causal self-attention, root-mean-square normalization, and two-layer feed-forward subnets with GeLU or SiLU activation.

Unlike discrete-token GPTs, GPT-PPG must model real values in $(0,1)$ . The pretraining objective maximizes the log-likelihood of observed data under an autoregressive logit-Laplace density:

$L_{\text{phsch}} = -\sum_{i=1}^N\sum_{j=1}^{40} \log f(x_{i,j}\mid\mu_{i,j},b_{i,j}),$

where $f(x; \mu, b) = \frac{1}{2bx(1-x)}\exp(-|\text{logit}(x) - \mu| / b)$ , and $(\mu_{i,j}, b_{i,j})$ are model outputs for each signal sample. Inputs are min–max normalized to $d$ 0, then linearly remapped to $d$ 1 to avoid loss singularities at the interval boundaries (Chen et al., 11 Mar 2025).

Pretraining employs >200 million 30 s PPG segments (2.6 million hours) from ICU monitors resampled at 40 Hz, without explicit augmentation apart from optional patch masking. Model capacity ranges from 19 M to 1 B parameters.

2. Supervised Fine-tuning and Mixed-Objective Head

Fine-tuning for downstream regression or classification attaches two heads in parallel:

(a) Signal modeling head: A linear layer maps hidden states to $d$ 2 pairs for all $d$ 3 values, enabling reconstruction via the logit-Laplace loss.
(b) Prediction head: Final hidden states undergo attention-pooling $d$ 4, then a gated MLP predicts the downstream target $d$ 5 via SiLU activations and multiplicative gating.

Total fine-tuning loss integrates task loss ( $d$ 6; MSE for regression, cross-entropy for classification) with the generative logit-Laplace loss ( $d$ 7) for signal modeling:

$d$ 8

where $d$ 9 is annealed to zero during training. Extensions include bidirectional feature extraction via mask-and-reconstruction with bidirectional attention (for increased feature richness); fallback parameter-efficient fine-tuning by freezing transformer layers; and test-time personalized domain adaptation by self-supervised alignment using 5–10% of the test PPG (Chen et al., 11 Mar 2025).

3. Evaluation: Downstream Performance and Scaling

GPT-PPG establishes state-of-the-art performance (benchmarks without additional label-based filtering):

Atrial fibrillation (AF) detection (Stanford; F1):

$h_i\in\mathbb{R}^d$ 3

Heart rate estimation (MAE in bpm):

$h_i\in\mathbb{R}^d$ 4

Respiration rate estimation (MAE in brpm, 5-fold CV):

SiamQuality: 0.89 | GPT-1B: 0.93

Blood pressure estimation (MAE in mmHg):

$h_i\in\mathbb{R}^d$ 5

Scaling up model size from 19 M → 85 M → 345 M → 1 B yields monotonic but diminishing accuracy gains (most improvement ≤85 M parameters) (Chen et al., 11 Mar 2025).

4. Generative Modeling and Zero-Shot Denoising

Autoregressive GPT-PPG supports zero-shot generative denoising by leveraging its autoregressive next-patch prediction head. Sequentially masking a fraction $h_i\in\mathbb{R}^d$ 0 of input patches enables the model to reconstruct missing regions; for $h_i\in\mathbb{R}^d$ 1, MAE remains <0.10 and qualitative fidelity degrades gracefully up to $h_i\in\mathbb{R}^d$ 2. Unlike models trained solely for classification, GPT-PPG does not require explicit denoising examples, as autoregressive pretraining with the logit-Laplace objective enables signal inpainting—in both in-distribution and out-of-distribution settings (Chen et al., 11 Mar 2025).

5. Transfer, Fairness, and Bias-Aware Fine-Tuning

Naive fine-tuning of GPT-PPG (also termed PPG-GPT) markedly reduces error but can increase demographic disparities, especially under domain or hardware shifts. The FairTune framework investigates three bias-mitigation strategies using the 19 M model: inverse-frequency class weighting (IF), Group Distributionally Robust Optimization (GroupDRO), and adversarial debiasing (ADV). IF and GroupDRO substantially reduce gender-based fairness gaps (ΔMAE), with task MAE minimally affected. Embedding analyses (MMD, Silhouette) confirm that IF and GroupDRO decorrelate demographic structure in representation space. ADV yields modest fairness gains but is unstable with respect to hyperparameters. The principal recommendation is to apply inverse-frequency class weighting for consumer device deployment and to audit both MAE and subgroup gaps under realistic domain shift scenarios (Panchumarthi et al., 20 Sep 2025).

6. Connections to Broader Foundation Models and Future Directions

GPT-PPG is part of a broader ecosystem of PPG and multivariate biosignal foundation models, including MMR (Masked Multiscale Reconstruction), PaPaGei, and conventional masked-encoder transformers (Thukral et al., 18 Jan 2026, Pillai et al., 2024). While MMR leverages wavelet-domain masking and decoding to explicitly encode spectral-temporal structure, GPT-PPG's core advantage is in generative, autoregressive patch-level modeling—critical for downstream synthesis and zero-shot tasks. Both architectures achieve strong data efficiency, but MMR results suggest that multi-resolution features boost certain tasks (e.g., hypertension or arrhythmia detection).

Identified limitations for GPT-PPG include its ICU-centric pretraining set (limiting multi-device/device generalization), the cost and infeasibility of deploying large models (≫300 M parameters) at the edge, and incomplete transfer between clinical and wearable domains. Future work is focused on broader pretraining (multi-site, multi-device), robust contrastive or BERT-style masked pretraining, model distillation, and integration with more efficient patch-free transformers (e.g., Mamba) (Chen et al., 11 Mar 2025).

7. Applications and Interpretability

GPT-PPG is applicable to a wide spectrum of PPG analysis: clinical event prediction (e.g., in-hospital cardiac arrest; AUROC up to 0.82 one hour before event), arrhythmia classification, cardiorespiratory vital prediction, and signal quality enhancement. Feature extractor–aggregator pipelines (e.g., FEAN) use GPT-PPG representations in cascades for structured event risk forecasting (Kataria et al., 12 Feb 2025). Embedding trajectories projected via dimensionality reduction reveal smooth paths corresponding to patient deterioration, providing a novel avenue for model interpretability, though explicit feature attribution remains outstanding.

GPT-PPG and its derivatives thus constitute a general-purpose, generative modeling approach for PPG biosignals, supporting classification, regression, generative synthesis, and domain-adaptive personalization with rigorous performance and emerging capacities for equitable real-world deployment.