Cross State-Space Propagation (CSSP)
- CSSP is a mechanism that propagates features across video frames using state-space modeling to enhance alignment in medical video super-resolution.
- It uses neighbor-driven updates and distant-driven observations through learnable convolutional modules to filter out misaligned or artifact-prone inputs.
- Empirical studies show that CSSP improves temporal consistency and PSNR, demonstrating robustness against camera shake and tissue deformation.
Cross State-Space Propagation (CSSP) is a recurrent feature propagation mechanism central to the MedVSR framework for medical video super-resolution. CSSP addresses the alignment challenges in low-resolution medical videos—namely, camera shake, noise, abrupt frame transitions, and the nuanced, continuous structures of tissue—by embedding information from both neighboring and distant frames into learnable state-space dynamics. CSSP achieves this by projecting distant frames into the observation matrices of a state-space model (SSM), allowing consistent and informative features to propagate recurrently while filtering out misaligned or artifact-prone content (Liu et al., 25 Sep 2025).
1. Fundamental Principles of Cross State-Space Propagation
CSSP leverages a linear, discrete-time state-space model structurally akin to the Kalman filter. Let denote the hidden state at step , the input token, and the output token. The standard SSM equations are given by: where is the state transition matrix, the input matrix, and the observation matrix.
Unlike classical state-space approaches, CSSP parameterizes (as diagonal or block-diagonal, fixed or learned), but crucially makes and 0 input-dependent. Specifically, 1 is modulated by neighbor-frame features and 2 dynamically incorporates warped distant-frame features.
2. CSSP Mechanism: Cross-Frame Feature Projection
At each temporal step 3 and propagation branch 4, CSSP operates as follows:
- Feature extraction: 5 from frame 6; 7 from frame 8 warped toward 9 via composite flow.
- Tokenization: Both feature maps are partitioned into non-overlapping 0 windows, forming token sequences 1 and 2.
- Parameter computation: Through 1D convolutions with LayerNorm and gating, CSSP yields:
- Inputs and state update (3, 4) from 5
- Observation 6 from 7 via a learnable position embedding (LPE).
The SSM is unrolled over the tokens, with the hidden state propagated by neighbor-frame features and the output projected through matrices modulated by distant-frame features. After per-token gating, concatenation, and MLP-based residual lifting, a deformable convolution fuses 8, the propagated features, and 9 for optimal alignment.
3. Cross-Frame Coupling and Robustness
The cross aspect of CSSP is encapsulated in its design: neighbor-frame tokens govern the state-update, while distance-warped tokens control the observation, achieving a form of cross-frame information integration within the SSM recurrence. Even when optical flow between non-consecutive frames (0 to 1) is compromised, the architecture bypasses direct warping in favor of a composite warp towards the nearest frame, thereby increasing robustness to scene discontinuities and flow estimation errors. The token-wise gating mechanism further suppresses the influence of misaligned regions by downweighting inconsistent outputs.
Ablation studies demonstrate that this cross-coupling confers a PSNR improvement of approximately 0.3 dB over single-frame SSM propagation and that omitting distant-frame control leads to a degradation of 0.32 dB, establishing the centrality of cross-frame dynamics in effective temporal alignment (Liu et al., 25 Sep 2025).
4. Algorithmic Workflow and Pseudocode
The following workflow summarizes CSSP for a single time-step and branch:
- Inputs: 2, optical flow 3
- Warp: Compute 4 via composite warp
- Tokenize: Obtain 5 and 6 using local windowing
- Convolutional processing: Compute 7 via gated Conv1d and LayerNorm
- SSM propagation: For each token 8 in 9
- 0
- 1
- Gating and aggregation: Apply token-wise gating 2 and output projection
- Residual enhancement: Combine with MLP and residual connection
- Alignment: Fuse with 3 via deformable convolution
Pseudocode (as provided):
0
5. Empirical Performance and Training Regimen
MedVSR with CSSP demonstrates significant improvements over BasicVSR++ on medical video datasets (HyperKvasir, LDPolyp, EndoVis18), achieving up to 0.37 dB higher PSNR while using fewer parameters. Qualitative assessments indicate that artifacts from misaligned distant frames are effectively suppressed. The pipeline employs 4 bidirectional CSSP branches, tokenization with local windows (5), SSM hidden dimension 6, and learnable 2D depthwise Conv-based position embeddings. Training uses the Charbonnier loss, cosine learning rate decay, 7 HR patches, Gaussian noise with 8, and bicubic 9 downsampling. SpyNet optical flow is employed for the composite warp. Inner State-Space Reconstruction (ISSR) and large-kernel separable convolutions enhance final reconstruction (Liu et al., 25 Sep 2025).
6. Architectural Choices and Practical Implications
CSSP achieves effective temporal alignment and information selection in challenging video conditions. Its architectural innovations—a small recurrent SSM with neighbor-driven updates and distant-driven observations, input-dependent parameterization via convolutional features, and robust gating—directly address instability in alignment when classical optical flow is unreliable due to camera shake or tissue deformation. The method is readily extensible to other video analysis domains where temporal consistency and artifact rejection are critical, particularly when imaging scenes with repetitive or ambiguous textures.
7. Context and Impact in Medical Video Super-Resolution
By integrating cross-frame feature propagation into the state-space modeling framework, CSSP advances the fidelity and reliability of medical video super-resolution, a domain where stringent alignment is necessary to avoid diagnostic artifacts. The cross-recurrence mechanism ensures that only consistent features propagate, reducing the risk of reconstructing misleading content in frames with challenging optical flow. The demonstrated improvements in PSNR and qualitative artifact rejection situate CSSP as a robust solution for real-world low-resolution medical video enhancement and set a precedent for future state-space recurrent modeling in video restoration tasks (Liu et al., 25 Sep 2025).