Cross State-Space Propagation (CSSP)

Updated 3 July 2026

CSSP is a mechanism that propagates features across video frames using state-space modeling to enhance alignment in medical video super-resolution.
It uses neighbor-driven updates and distant-driven observations through learnable convolutional modules to filter out misaligned or artifact-prone inputs.
Empirical studies show that CSSP improves temporal consistency and PSNR, demonstrating robustness against camera shake and tissue deformation.

Cross State-Space Propagation (CSSP) is a recurrent feature propagation mechanism central to the MedVSR framework for medical video super-resolution. CSSP addresses the alignment challenges in low-resolution medical videos—namely, camera shake, noise, abrupt frame transitions, and the nuanced, continuous structures of tissue—by embedding information from both neighboring and distant frames into learnable state-space dynamics. CSSP achieves this by projecting distant frames into the observation matrices of a state-space model (SSM), allowing consistent and informative features to propagate recurrently while filtering out misaligned or artifact-prone content (Liu et al., 25 Sep 2025).

1. Fundamental Principles of Cross State-Space Propagation

CSSP leverages a linear, discrete-time state-space model structurally akin to the Kalman filter. Let $h_i\in\mathbb{R}^N$ denote the hidden state at step $i$ , $x_i\in\mathbb{R}^d$ the input token, and $y_i\in\mathbb{R}^m$ the output token. The standard SSM equations are given by: $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ where $A \in \mathbb{R}^{N \times N}$ is the state transition matrix, $B \in \mathbb{R}^{N \times d}$ the input matrix, and $C \in \mathbb{R}^{m \times N}$ the observation matrix.

Unlike classical state-space approaches, CSSP parameterizes $A$ (as diagonal or block-diagonal, fixed or learned), but crucially makes $B$ and $i$ 0 input-dependent. Specifically, $i$ 1 is modulated by neighbor-frame features and $i$ 2 dynamically incorporates warped distant-frame features.

2. CSSP Mechanism: Cross-Frame Feature Projection

At each temporal step $i$ 3 and propagation branch $i$ 4, CSSP operates as follows:

Feature extraction: $i$ 5 from frame $i$ 6; $i$ 7 from frame $i$ 8 warped toward $i$ 9 via composite flow.
Tokenization: Both feature maps are partitioned into non-overlapping $x_i\in\mathbb{R}^d$ 0 windows, forming token sequences $x_i\in\mathbb{R}^d$ 1 and $x_i\in\mathbb{R}^d$ 2.
Parameter computation: Through 1D convolutions with LayerNorm and gating, CSSP yields:
- Inputs and state update ( $x_i\in\mathbb{R}^d$ 3, $x_i\in\mathbb{R}^d$ 4) from $x_i\in\mathbb{R}^d$ 5
- Observation $x_i\in\mathbb{R}^d$ 6 from $x_i\in\mathbb{R}^d$ 7 via a learnable position embedding (LPE).

The SSM is unrolled over the tokens, with the hidden state propagated by neighbor-frame features and the output projected through matrices modulated by distant-frame features. After per-token gating, concatenation, and MLP-based residual lifting, a deformable convolution fuses $x_i\in\mathbb{R}^d$ 8, the propagated features, and $x_i\in\mathbb{R}^d$ 9 for optimal alignment.

3. Cross-Frame Coupling and Robustness

The cross aspect of CSSP is encapsulated in its design: neighbor-frame tokens govern the state-update, while distance-warped tokens control the observation, achieving a form of cross-frame information integration within the SSM recurrence. Even when optical flow between non-consecutive frames ( $y_i\in\mathbb{R}^m$ 0 to $y_i\in\mathbb{R}^m$ 1) is compromised, the architecture bypasses direct warping in favor of a composite warp towards the nearest frame, thereby increasing robustness to scene discontinuities and flow estimation errors. The token-wise gating mechanism further suppresses the influence of misaligned regions by downweighting inconsistent outputs.

Ablation studies demonstrate that this cross-coupling confers a PSNR improvement of approximately 0.3 dB over single-frame SSM propagation and that omitting distant-frame control leads to a degradation of 0.32 dB, establishing the centrality of cross-frame dynamics in effective temporal alignment (Liu et al., 25 Sep 2025).

4. Algorithmic Workflow and Pseudocode

The following workflow summarizes CSSP for a single time-step and branch:

Inputs: $y_i\in\mathbb{R}^m$ 2, optical flow $y_i\in\mathbb{R}^m$ 3
Warp: Compute $y_i\in\mathbb{R}^m$ 4 via composite warp
Tokenize: Obtain $y_i\in\mathbb{R}^m$ 5 and $y_i\in\mathbb{R}^m$ 6 using local windowing
Convolutional processing: Compute $y_i\in\mathbb{R}^m$ 7 via gated Conv1d and LayerNorm
SSM propagation: For each token $y_i\in\mathbb{R}^m$ $y_{i} \in R^{m}$ 8 in $y_i\in\mathbb{R}^m$ $y_{i} \in R^{m}$ 9
- $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 0
- $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 1
Gating and aggregation: Apply token-wise gating $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 2 and output projection
Residual enhancement: Combine with MLP and residual connection
Alignment: Fuse with $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 3 via deformable convolution

Pseudocode (as provided):

$A \in \mathbb{R}^{N \times N}$ 0

5. Empirical Performance and Training Regimen

MedVSR with CSSP demonstrates significant improvements over BasicVSR++ on medical video datasets (HyperKvasir, LDPolyp, EndoVis18), achieving up to 0.37 dB higher PSNR while using fewer parameters. Qualitative assessments indicate that artifacts from misaligned distant frames are effectively suppressed. The pipeline employs $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 4 bidirectional CSSP branches, tokenization with local windows ( $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 5), SSM hidden dimension $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 6, and learnable 2D depthwise Conv-based position embeddings. Training uses the Charbonnier loss, cosine learning rate decay, $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 7 HR patches, Gaussian noise with $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 8, and bicubic $h_i = A h_{i-1} + B x_i,\quad y_i = C h_i$ 9 downsampling. SpyNet optical flow is employed for the composite warp. Inner State-Space Reconstruction (ISSR) and large-kernel separable convolutions enhance final reconstruction (Liu et al., 25 Sep 2025).

6. Architectural Choices and Practical Implications

CSSP achieves effective temporal alignment and information selection in challenging video conditions. Its architectural innovations—a small recurrent SSM with neighbor-driven updates and distant-driven observations, input-dependent parameterization via convolutional features, and robust gating—directly address instability in alignment when classical optical flow is unreliable due to camera shake or tissue deformation. The method is readily extensible to other video analysis domains where temporal consistency and artifact rejection are critical, particularly when imaging scenes with repetitive or ambiguous textures.

7. Context and Impact in Medical Video Super-Resolution

By integrating cross-frame feature propagation into the state-space modeling framework, CSSP advances the fidelity and reliability of medical video super-resolution, a domain where stringent alignment is necessary to avoid diagnostic artifacts. The cross-recurrence mechanism ensures that only consistent features propagate, reducing the risk of reconstructing misleading content in frames with challenging optical flow. The demonstrated improvements in PSNR and qualitative artifact rejection situate CSSP as a robust solution for real-world low-resolution medical video enhancement and set a precedent for future state-space recurrent modeling in video restoration tasks (Liu et al., 25 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cross State-Space Propagation (CSSP).