PhysFormer: Transformer-based rPPG

Updated 23 October 2025

The paper introduces an end-to-end transformer architecture that aggregates local and global spatio-temporal features for robust rPPG estimation.
PhysFormer++ employs a dual SlowFast pathway with periodic and cross-attention modules to capture complex temporal dynamics effectively.
Spiking-PhysFormer leverages hybrid neural networks to drastically reduce power consumption while maintaining competitive accuracy across benchmark datasets.

Remote photoplethysmography (rPPG) refers to the estimation of physiological signals such as heart rate (HR), respiration frequency (RF), and heart rate variability (HRV) from facial videos without physical contact. The PhysFormer family of models—comprising PhysFormer, PhysFormer++, and Spiking-PhysFormer—each leverage architectural innovations in video transformers for robust rPPG measurement in challenging, real-world scenarios.

1. Motivation and Conceptual Foundations

PhysFormer addresses shortcomings in conventional rPPG approaches, which are limited either by handcrafted signal processing pipelines or local receptive fields typical of convolutional neural networks (CNNs). Standard CNN-based solutions capture only restricted spatio-temporal relationships, often failing to model the quasi-periodic behavior and subtle skin color changes induced by physiological signals. PhysFormer introduces an end-to-end video transformer architecture designed to aggregate both local and global spatio-temporal features directly from raw facial video inputs (Yu et al., 2021). PhysFormer++ extends this principle further with a two-pathway SlowFast design to accommodate complex temporal structures (Yu et al., 2023). Spiking-PhysFormer explores the domain of hybrid neural networks by integrating spiking neural network (SNN) mechanisms, emphasizing power efficiency for edge deployment (Liu et al., 7 Feb 2024).

2. Model Architecture

PhysFormer

PhysFormer employs a hierarchical transformer structure with the following key components:

Shallow Stem ( $\mathbb{E}_{\text{stem}}$ ): Initial convolutional blocks extract coarse spatio-temporal features from input video frames.
Tube Tokenization ( $\mathbb{E}_{\text{tube}}$ ): Converts the shallow feature map $X_{\text{stem}}\in\mathbb{R}^{D\times T\times H/8\times W/8}$ into non-overlapping 3D tube tokens with sizes $T_s\times H_s\times W_s$ . Tokens are indexed as $T' = \lfloor T/T_s \rfloor$ , $H' = \lfloor (H/8)/H_s \rfloor$ , $W' = \lfloor (W/8)/W_s \rfloor$ .
Cascaded Temporal Difference Transformer Blocks: Built from Temporal Difference Multi-head Self-Attention (TD-MHSA) and Spatio-temporal Feed-forward (ST-FF) modules.
rPPG Predictor Head: Applies temporal upsampling, spatial averaging, and a final projection to generate 1D physiological signals.

PhysFormer++

PhysFormer++ implements dual pathway encoding modeled after SlowFast networks. The “Slow” pathway processes lower temporal resolution with higher channel capacity while the “Fast” pathway samples frames more densely with fewer channels, capturing fine details. Temporal Difference Periodic Transformer and Temporal Difference Cross-Attention modules augment the self-attention mechanism with learnable contextual positional encodings and cross-stream feature fusion.

Spiking-PhysFormer

Spiking-PhysFormer employs a hybrid neural network approach:

ANN-Based Patch Embedding Block: Embeds input patches via linear projection ( $E=W_e X + b_e$ ).
SNN-Based Transformer Blocks: Central feature extraction stages leverage spike-driven parallel transformer blocks with a simplified self-attention mechanism that omits the value parameter, enabling threshold-based or sparse lookup operations.
ANN-Based Predictor Head: Aggregates SNN-enhanced features and regresses physiological signals.

3. Core Computational Modules

Temporal Difference Convolution (TDC)

The TDC operation is defined as:

$TDC(x) = \sum_{p_n \in \mathcal{R}} w(p_n)\cdot x(p_0+p_n) + \theta \cdot [ -x(p_0)\cdot \sum_{p_n \in \mathcal{R}'} w(p_n) ],$

where $p_0$ is the target spatio-temporal index, $\mathcal{R}$ and $\mathcal{R}'$ are neighborhoods across space and adjacent time, and $\theta$ weights the difference term. This TDC precedes the query and key projections in the TD-MHSA module, yielding features sensitive to temporal skin color changes.

Self-Attention and Feed-forward Layers

Within each TD-MHSA block, multi-head attention is calculated as:

$SA_i = \text{Softmax}(Q_iK_i^\top / \tau) V_i,$

using a reduced temperature parameter $\tau$ for increased sparsity, appropriate to quasi-periodic rPPG signals.

The ST-FF module incorporates depthwise 3D convolution and batch normalization to refine local context, counterbalancing global transformer attention.

Innovations in PhysFormer++

PhysFormer++ introduces:

Temporal Difference Periodic Transformer: Augments attention with periodic positional encodings $R$ , calculating $S = Q R^\top$ and fusing periodic and standard self-attention via $CPSA = \text{Softmax}((QK^\top + \lambda S)/\tau)V$ , with $\lambda$ and $\tau$ modulating fusion and sparsity, respectively.
Temporal Difference Cross-Attention Transformer: Enables cross-path interaction wherein the Fast stream queries the Slow stream’s richer semantic tokens.

Spiking-PhysFormer Mechanisms

Spiking self-attention is simplified relative to standard transformer attention, using spike-coded queries/keys and thresholding rather than full softmax. Binary spike-based activations reduce computational load and facilitate asynchronous parallelism.

4. Supervision, Loss Functions, and Learning Strategies

Label Distribution Learning

Instead of regressing a single HR value, rPPG prediction is formulated as a multi-label classification over $L$ discrete HR classes (e.g., $L=139$ for HR in $[42,180]$ BPM):

$p_k = \frac{1}{\sqrt{2\pi}\sigma} \exp\left[-\frac{(k - (Y_{HR}-41))^2}{2\sigma^2}\right].$

This Gaussian label smoothing propagates similarity across neighboring classes, improves robustness, and mitigates overfitting in limited data regimes.

Curriculum Learning with Frequency-Domain Dynamic Constraint

The overall loss is defined as:

$\mathcal{L}_{\text{overall}} = \alpha \cdot \mathcal{L}_{\text{time}} + \beta \cdot (\mathcal{L}_{\text{CE}} + \mathcal{L}_{\text{LD}}),$

with $\beta$ exponentially ramped per epoch: $\beta = \beta_0 \cdot \eta^{(\text{Epoch}_{\text{current}}-1)/\text{Epoch}_{\text{total}}}$ . This dynamic scheduling enforces frequency-domain constraints gradually, aiding convergence to precise periodic signal estimation.

5. Benchmarking and Empirical Results

PhysFormer and PhysFormer++ were extensively tested on VIPL-HR, MAHNOB-HCI, MMSE-HR, and OBF datasets (Yu et al., 2021, Yu et al., 2023). All models utilize subject-exclusive cross-validation and are trained from scratch, requiring no large-scale pretraining. Data preprocessing includes MTCNN-based face detection, temporal up-/down-sampling, and horizontal flipping.

Performance is reported as follows:

PhysFormer achieves RMSE $\lesssim 7.7$ BPM and high correlation on VIPL-HR, outperforming CNN baselines (PhysNet, DeepPhys, AutoHR) and classical ROI-based methods.
Cross-dataset generalization (VIPL-HR $\to$ MMSE-HR) demonstrates strong robustness.
OBF results include accurate HRV (“LF,” “HF,” “LF/HF”) and RF measurement.
Ablation studies confirm critical roles for tube tokenization, TD-MHSA/TD-MHPSA/TD-MHCSA blocks, and dynamic frequency-supervised loss.

Spiking-PhysFormer (Liu et al., 7 Feb 2024), evaluated on PURE, UBFC-rPPG, UBFC-Phys, and MMPD datasets, maintains competitive accuracy but achieves a $12.4\%$ reduction in power consumption overall and a $12.2\times$ reduction in transformer block power.

6. Comparison with Preceding Methodologies

PhysFormer overcomes the principal weaknesses of non-end-to-end and CNN architectures: non-end-to-end models rely on preprocessed signals and manual ROI selection, whereas CNN architectures have constrained spatio-temporal receptive fields. In contrast:

PhysFormer integrates local and long-range attention, enhancing quasi-periodic feature extraction directly from videos without explicit ROI preselection or reliance on downstream signal processing.
PhysFormer++ further exploits multi-timescale analysis via parallel paths, and unique periodic/cross-attention mechanisms refine the extraction of complex physiological signals.
Spiking-PhysFormer demonstrates that hybrid neural network architectures can maintain prediction fidelity while drastically reducing power consumption.

7. Practical Implications and Directions for Future Research

The PhysFormer family sets a strong transformer-based baseline for the rPPG community. Key implications include:

Edge Deployment: Spiking-PhysFormer’s power savings suggest viability for mobile and wearable devices in real-time monitoring.
Longer-Sequence rPPG: Future developments may focus on quantization, binarization, and more scalable attention for persistent measurement.
Multi-modal Biometrics: The transformer backbone may be adapted for fusion with depth or thermal channels, increasing robustness.
Broader Video Analysis: Temporal difference mechanisms are suitable for other fine-grained or periodic tasks such as action recognition, repetition counting, and affective computing.
Efficient Attention: Designing “more accurate yet efficient spatio-temporal self-attention mechanisms” is cited as a priority for long video sequences.

The PhysFormer lineage represents a substantive advance in end-to-end, transformer-based rPPG measurement, validated across benchmarks and increasingly adapted for low-power, real-world applications.

PDF Markdown Chat (Pro)

References (3)

PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer (2021)

PhysFormer++: Facial Video-based Physiological Measurement with SlowFast Temporal Difference Transformer (2023)

Spiking-PhysFormer: Camera-Based Remote Photoplethysmography with Parallel Spike-driven Transformer (2024)

Follow Topic

Get notified by email when new papers are published related to PhysFormer.