Papers
Topics
Authors
Recent
Search
2000 character limit reached

MedVSR: Medical Video Super-Resolution

Updated 3 July 2026
  • The paper presents MedVSR, a novel two-stage state-space framework that enhances clinical videos by addressing artifacts like camera shake and physiological motion.
  • It leverages Cross State-Space Propagation for robust cross-frame feature alignment and Inner State-Space Reconstruction for intra-frame detail enhancement.
  • Evaluations across multiple benchmarks show that MedVSR achieves superior PSNR/SSIM with fewer parameters and lower FLOPs, supporting improved diagnostic imaging.

MedVSR is a medical video super-resolution (VSR) framework designed to address the unique challenges of low-resolution (LR) clinical video data. Existing VSR models often fail to account for medical-specific artifacts—including camera shake, physiological motion, and abrupt scene changes—which degrade the alignment quality and introduce spurious features that may compromise diagnostic accuracy. MedVSR introduces a two-stage state-space modeling strategy, integrating Cross State-Space Propagation (CSSP) for cross-frame feature selection and Inner State-Space Reconstruction (ISSR) for intra-frame structural enhancement. This architecture delivers leading quantitative and qualitative results across multiple medical VSR benchmarks, while maintaining computational efficiency (Liu et al., 25 Sep 2025).

1. Architectural Overview

MedVSR operates on LR video inputs F1,…,FTF_1, \dots, F_T to produce high-resolution (HR) outputs F1′,…,FT′F'_1, \dots, F'_T with an upsampling factor (typically ×4\times4). The architecture proceeds in three principle stages:

  1. 2D Convolutional Feature Extraction: Each LR frame is independently encoded to generate spatial features.
  2. Bidirectional CSSP Propagation Branches: For each time step tt and branch jj, bidirectionally propagated features {ft−1j,ft−2j}\{f_{t-1}^j, f_{t-2}^j\} are aligned and filtered using a state-space mechanism, yielding f^tj\hat{f}_t^j.
  3. ISSR Fusion and Upscaling: Features {f^t1,...,f^tJ}\{\hat{f}_t^1, ..., \hat{f}_t^J\} are fused and passed through a second state-space module for long-range spatial aggregation, followed by a large-kernel separable block (LKSB) for localized detail enhancement, generating gtg_t. A final pixel-shuffle layer reconstructs the SR frame Ft′F'_t.

This modular structure supports efficient sequence modeling with high-fidelity reconstruction, as evidenced by Figure 1 in (Liu et al., 25 Sep 2025).

2. Cross State-Space Propagation (CSSP)

CSSP extends the classical discrete linear state-space model (SSM):

F1′,…,FT′F'_1, \dots, F'_T0

Discretized by zero-order hold with step F1′,…,FT′F'_1, \dots, F'_T1, the update becomes:

F1′,…,FT′F'_1, \dots, F'_T2

where F1′,…,FT′F'_1, \dots, F'_T3 and F1′,…,FT′F'_1, \dots, F'_T4.

Within CSSP, warp-aligned distant frame features F1′,…,FT′F'_1, \dots, F'_T5 and neighbor frame features F1′,…,FT′F'_1, \dots, F'_T6 are windowed. Frame-dependent parameters are estimated:

  • SSM input: F1′,…,FT′F'_1, \dots, F'_T7
  • SSM input transform: F1′,…,FT′F'_1, \dots, F'_T8
  • Control matrix: F1′,…,FT′F'_1, \dots, F'_T9

Hidden state ×4\times40 is initialized from ×4\times41. For each token ×4\times42:

×4\times43

Tokens are reassembled and gated to yield ×4\times44, which is deformably convolved with ×4\times45 to produce the aligned output ×4\times46. CSSP thereby filters spatio-temporal features, preserving only consistent components across frames.

3. Inner State-Space Reconstruction (ISSR)

Post-CSSP features are fused by ×4\times47 convolution, partitioned into windows ×4\times48, and processed via the Inner State-Space Block (ISSB):

  • Data-dependent SSM parameters (input, transforms, gate) are computed from ×4\times49.
  • A window-wise state-space scan is performed:

tt0

and tokens are aggregated:

tt1

  • The LKSB module refines spatial structure:

tt2

Using depthwise and pointwise convolution factorization with kernel size tt3.

ISSR thus combines nonlocal tissue structure preservation with local texture aggregation, critical for artifact suppression and boundary restoration.

4. Training Regimen and Propagation Variants

MedVSR is trained with the Charbonnier loss:

tt4

No adversarial or perceptual objectives are used. Adam optimizer is set with initial lr tt5, cosine decayed to tt6 over tt7k iterations. Training uses tt8 crops degraded by bicubic tt9 downsampling and i.i.d. Gaussian noise (jj0). Four propagation branches (jj1) are used. All LR test inputs are generated consistently with this synthetic degradation scheme.

5. Evaluation Datasets and Preprocessing

MedVSR is benchmarked on four scenarios:

  • HyperKvasir: Gastrointestinal endoscopy; standard train/val/test split.
  • LDPolypVideo: Colonoscopy polyp scenes.
  • EndoVis18: Robotic surgical instruments with abrupt scene changes.
  • Cataract-101: Ophthalmic cataract surgery videos (101 sequences).

All training and test LR samples use bicubic downsampling plus Gaussian noise. For domain specialization, separate models are trained and evaluated for gastrointestinal and cataract scenarios.

6. Performance Metrics and Comparative Results

Quantitative assessment is provided for PSNR, SSIM, model size, FLOPs, and latency (4090 GPU). MedVSR achieves the leading scores across all datasets (see Table 1, Table 2 below; abbreviations: HK = HyperKvasir, LP = LDPolyp, EV = EndoVis18):

Model Params (M) FLOPs (T) HK-PSNR/SSIM LP-PSNR/SSIM EV-PSNR/SSIM Cataract PSNR/SSIM
EDVR 20.6 45.4 27.12/0.8483 30.08/0.8417 21.67/0.7883 31.78/0.9103
BasicVSR 6.3 8.6 31.46/0.8990 31.68/0.8650 30.72/0.8950 35.29/0.9232
BasicVSR++ 7.3 9.5 31.73/0.9042 31.67/0.8620 30.80/0.8958 35.54/0.9251
VSRT 32.6 112.6 31.78/0.8999 31.63/0.8596 30.59/0.8916 35.25/0.9230
RVRT 10.8 44.4 29.26/0.8944 31.32/0.8578 30.72/0.8932 35.64/0.9281
TCNet 9.6 39.6 31.11/0.8889 31.39/0.8547 30.17/0.8859 34.60/0.9177
IART 13.4 44.7 31.30/0.9030 31.70/0.8671 30.47/0.8924 36.08/0.9275
MedVSR (Ours) 7.2 9.5 32.10/0.9069 31.83/0.8673 30.83/0.8960 36.23/0.9308

MedVSR has the fewest parameters among top performers and the lowest or near-lowest FLOPs and runtime.

Ablation analyses show PSNR drops of 0.29, 0.28, and 0.36 dB when removing position embedding, local windows, or separate projections in CSSB, respectively (Table 3). In ISSB, eliminating windows or replacing the concatenation-gate results in PSNR changes of jj2 and jj3 dB. The LKSB is optimal at jj4 kernel, with performance drops for other sizes.

7. Qualitative Analysis and Significance

MedVSR demonstrates effective suppression of motion-jitter and alignment artifacts, as shown in qualitative comparisons (Figure 2). It preserves tissue textures and subtle boundaries (Figure 3 and Figure 4), avoiding the creation of hallucinated structures observed in alternative methods. In cataract surgeries, tool and lens boundaries are maintained, which may support surgical navigation and decision-making.

This suggests that the adoption of state-space mechanisms for both temporal propagation (CSSP) and spatial detail enhancement (ISSR) confers a significant advantage in reconstructing diagnostically faithful imagery in challenging clinical video scenarios.


For further details and code, see the repository at https://github.com/CUHK-AIM-Group/MedVSR (Liu et al., 25 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedVSR.