- The paper presents a novel framework using neural CDEs to model cardiac periodicity and improve remote heart rate estimation.
- It introduces temporal-spatial state space duality with a complex state transition matrix to achieve efficient, low-latency inference.
- Experimental results demonstrate a 49% error reduction and up to 83% latency improvement across diverse datasets.
FacePhys: Efficient State Space Modeling for Real-Time Remote Physiological Measurement
Introduction and Motivation
FacePhys proposes a novel framework for remote photoplethysmography (rPPG) — the non-contact estimation of physiological signals such as heart rate from video. The paper addresses major limitations of deep learning-based rPPG, including computational inefficiency and loss of long-range temporal dependencies, particularly in practical deployment scenarios where on-device, real-time, and private inference are paramount. FacePhys leverages a neural Controlled Differential Equation (CDE)-based state space model (SSM) and introduces temporal-spatial state space duality (TSD), enabling both efficient training on arbitrarily long videos and extremely low-latency, memory-efficient online inference.
Methodology
FacePhys frames the cardiac signal as a latent dynamical system, modeled via a neural CDE, which captures the physiological periodicity of the heart rather than exploiting generic temporal dependencies as in RNNs.
Figure 1: The discretized state space form enables efficient computation, in contrast to the low-efficiency ideal continuous-time CDE formulation for heart state evolution.
The model is discretized using the Zero-Order Hold (ZOH) method, preserving equivalence between continuous and discrete state transitions, which is critical for practical video data. The key mathematical innovation is that the recurrent hidden state recursion under the ZOH yields linear complexity in time and enables long-sequence modeling via memory-efficient SSM attention duality.
Figure 2: The FacePhys framework combines the SSM dual as a discretization solver for the heart state CDE and as a linear attention mechanism, with temporal normalization for stability and a complex state transition matrix facilitating periodic attention.
Temporal features are stabilized by a dedicated Temporal Normalization (TN) module, which eliminates trends via detrending (least squares in training, recursive moving average in inference) and standardization. This ensures numerical stability and enables constant complexity inference, even over extended video streams.
Spatial and temporal modeling are decoupled: spatial features are extracted using 2D convolutions, while temporal dependencies are modeled via SSM duality, allowing for training/inference over long sequences without quadratic memory or computation costs, a critical advantage over transformer-based approaches.
Periodicity via Complex State Transition Matrix
To encode physiological cardiac periodicity, FacePhys parameterizes the diagonal elements of the state transition matrix A using trainable complex numbers. This yields hidden-state evolution with oscillatory components, aligning the inductive bias of the network directly with the periodic nature of cardiac signals.
Figure 3: Introducing trainable complex numbers in the diagonal state transition matrix A generates oscillatory terms in the solution, functionally corresponding to periodic attention.
Mathematically, the eigenvalues of the complex-diagonal A induce both exponential decay and oscillatory sinusoids. This is dual to periodic attention in the SSM’s convolutional expansion, allowing long-range temporal dependencies to be modeled with the necessary periodic structure — critical for accurate physiological signal recovery.
Experimental Evidence and Results
Comprehensive experiments across five large-scale, diverse datasets (MMPD, PURE, UBFC, RLAP, VitalVideo) showcase FacePhys’s superior intra- and cross-dataset generalization, with substantial improvements in accuracy, inference latency, and memory efficiency relative to state-of-the-art baselines. The framework achieves a 49% reduction in heart rate estimation error and up to 83% reduction in per-frame latency versus prior leading methods, confirmed by extensive ablation and chunk-length studies.
Figure 4: FacePhys achieves markedly better model accuracy and latency compared to existing approaches, validating the superiority of heart state space modeling.
FacePhys attains the lowest memory footprint (3.6 MB), supporting real-time streaming inference with per-frame latency of 9.46 ms, outperforming all compared methods, including transformer and advanced SSM architectures (e.g., Mamba, PhysFormer, RhythmMamba). The model supports training on full-length video sequences — a regime previously inaccessible due to memory explosion in other frameworks — and demonstrates stable performance across varied real-world (e.g., mobile, low-bandwidth) settings.
Ablation studies confirm the necessity of each architectural module: removing TN, SSM duality, or oscillator matrix A sharply degrades performance, underscoring the effectiveness of each design component.
Implications and Future Directions
Practically, FacePhys’s computational and memory efficiency opens viable deployment on resource-constrained edge and mobile devices, enabling instant, privacy-preserving, and real-time cardiac monitoring — a critical step for ubiquitous health sensing and telemedicine.
Theoretically, the work demonstrates the efficacy of embedding domain-specific priors by aligning the model’s inductive bias (periodicity, physiological dynamics) with latent state space structure, thereby improving generalization and robustness under distribution shift and cross-dataset transfer. The approach outperforms both CNNs (limited receptive field) and transformers/vanilla SSMs (high computational burden, lack of physiological periodicity).
The explicit periodic structure induced by complex diagonal SSMs establishes a powerful blueprint for modeling other quasi-periodic physiological time series.
Limitations remain with respect to utilization in clinical/hospital environments and in cardiovascular disease populations, where future clinical validation is required. The authors indicate future work in extending FacePhys to multimodal sensor fusion (e.g., thermal, IMU), blood oxygen and blood pressure estimation, and further optimization for higher frame-rate real-time operation.
Conclusion
FacePhys establishes a new standard for efficiency and accuracy in remote camera-based physiological measurement, uniting neural CDE modeling with structured SSM duality and domain-specific periodicity constraints. This results in robust long-range temporal modeling, extremely low latency, and suitability for on-device and streaming physiological monitoring. By introducing structured modeling aligned with the underlying biomedical phenomenon, FacePhys lays the groundwork for advanced, generalizable, and practical health sensing AI systems.
Reference: "FacePhys: State of the Heart Learning" (2512.06275)