MedVSR: Medical Video Super-Resolution
- The paper presents MedVSR, a novel two-stage state-space framework that enhances clinical videos by addressing artifacts like camera shake and physiological motion.
- It leverages Cross State-Space Propagation for robust cross-frame feature alignment and Inner State-Space Reconstruction for intra-frame detail enhancement.
- Evaluations across multiple benchmarks show that MedVSR achieves superior PSNR/SSIM with fewer parameters and lower FLOPs, supporting improved diagnostic imaging.
MedVSR is a medical video super-resolution (VSR) framework designed to address the unique challenges of low-resolution (LR) clinical video data. Existing VSR models often fail to account for medical-specific artifacts—including camera shake, physiological motion, and abrupt scene changes—which degrade the alignment quality and introduce spurious features that may compromise diagnostic accuracy. MedVSR introduces a two-stage state-space modeling strategy, integrating Cross State-Space Propagation (CSSP) for cross-frame feature selection and Inner State-Space Reconstruction (ISSR) for intra-frame structural enhancement. This architecture delivers leading quantitative and qualitative results across multiple medical VSR benchmarks, while maintaining computational efficiency (Liu et al., 25 Sep 2025).
1. Architectural Overview
MedVSR operates on LR video inputs to produce high-resolution (HR) outputs with an upsampling factor (typically ). The architecture proceeds in three principle stages:
- 2D Convolutional Feature Extraction: Each LR frame is independently encoded to generate spatial features.
- Bidirectional CSSP Propagation Branches: For each time step and branch , bidirectionally propagated features are aligned and filtered using a state-space mechanism, yielding .
- ISSR Fusion and Upscaling: Features are fused and passed through a second state-space module for long-range spatial aggregation, followed by a large-kernel separable block (LKSB) for localized detail enhancement, generating . A final pixel-shuffle layer reconstructs the SR frame .
This modular structure supports efficient sequence modeling with high-fidelity reconstruction, as evidenced by Figure 1 in (Liu et al., 25 Sep 2025).
2. Cross State-Space Propagation (CSSP)
CSSP extends the classical discrete linear state-space model (SSM):
0
Discretized by zero-order hold with step 1, the update becomes:
2
where 3 and 4.
Within CSSP, warp-aligned distant frame features 5 and neighbor frame features 6 are windowed. Frame-dependent parameters are estimated:
- SSM input: 7
- SSM input transform: 8
- Control matrix: 9
Hidden state 0 is initialized from 1. For each token 2:
3
Tokens are reassembled and gated to yield 4, which is deformably convolved with 5 to produce the aligned output 6. CSSP thereby filters spatio-temporal features, preserving only consistent components across frames.
3. Inner State-Space Reconstruction (ISSR)
Post-CSSP features are fused by 7 convolution, partitioned into windows 8, and processed via the Inner State-Space Block (ISSB):
- Data-dependent SSM parameters (input, transforms, gate) are computed from 9.
- A window-wise state-space scan is performed:
0
and tokens are aggregated:
1
- The LKSB module refines spatial structure:
2
Using depthwise and pointwise convolution factorization with kernel size 3.
ISSR thus combines nonlocal tissue structure preservation with local texture aggregation, critical for artifact suppression and boundary restoration.
4. Training Regimen and Propagation Variants
MedVSR is trained with the Charbonnier loss:
4
No adversarial or perceptual objectives are used. Adam optimizer is set with initial lr 5, cosine decayed to 6 over 7k iterations. Training uses 8 crops degraded by bicubic 9 downsampling and i.i.d. Gaussian noise (0). Four propagation branches (1) are used. All LR test inputs are generated consistently with this synthetic degradation scheme.
5. Evaluation Datasets and Preprocessing
MedVSR is benchmarked on four scenarios:
- HyperKvasir: Gastrointestinal endoscopy; standard train/val/test split.
- LDPolypVideo: Colonoscopy polyp scenes.
- EndoVis18: Robotic surgical instruments with abrupt scene changes.
- Cataract-101: Ophthalmic cataract surgery videos (101 sequences).
All training and test LR samples use bicubic downsampling plus Gaussian noise. For domain specialization, separate models are trained and evaluated for gastrointestinal and cataract scenarios.
6. Performance Metrics and Comparative Results
Quantitative assessment is provided for PSNR, SSIM, model size, FLOPs, and latency (4090 GPU). MedVSR achieves the leading scores across all datasets (see Table 1, Table 2 below; abbreviations: HK = HyperKvasir, LP = LDPolyp, EV = EndoVis18):
| Model | Params (M) | FLOPs (T) | HK-PSNR/SSIM | LP-PSNR/SSIM | EV-PSNR/SSIM | Cataract PSNR/SSIM |
|---|---|---|---|---|---|---|
| EDVR | 20.6 | 45.4 | 27.12/0.8483 | 30.08/0.8417 | 21.67/0.7883 | 31.78/0.9103 |
| BasicVSR | 6.3 | 8.6 | 31.46/0.8990 | 31.68/0.8650 | 30.72/0.8950 | 35.29/0.9232 |
| BasicVSR++ | 7.3 | 9.5 | 31.73/0.9042 | 31.67/0.8620 | 30.80/0.8958 | 35.54/0.9251 |
| VSRT | 32.6 | 112.6 | 31.78/0.8999 | 31.63/0.8596 | 30.59/0.8916 | 35.25/0.9230 |
| RVRT | 10.8 | 44.4 | 29.26/0.8944 | 31.32/0.8578 | 30.72/0.8932 | 35.64/0.9281 |
| TCNet | 9.6 | 39.6 | 31.11/0.8889 | 31.39/0.8547 | 30.17/0.8859 | 34.60/0.9177 |
| IART | 13.4 | 44.7 | 31.30/0.9030 | 31.70/0.8671 | 30.47/0.8924 | 36.08/0.9275 |
| MedVSR (Ours) | 7.2 | 9.5 | 32.10/0.9069 | 31.83/0.8673 | 30.83/0.8960 | 36.23/0.9308 |
MedVSR has the fewest parameters among top performers and the lowest or near-lowest FLOPs and runtime.
Ablation analyses show PSNR drops of 0.29, 0.28, and 0.36 dB when removing position embedding, local windows, or separate projections in CSSB, respectively (Table 3). In ISSB, eliminating windows or replacing the concatenation-gate results in PSNR changes of 2 and 3 dB. The LKSB is optimal at 4 kernel, with performance drops for other sizes.
7. Qualitative Analysis and Significance
MedVSR demonstrates effective suppression of motion-jitter and alignment artifacts, as shown in qualitative comparisons (Figure 2). It preserves tissue textures and subtle boundaries (Figure 3 and Figure 4), avoiding the creation of hallucinated structures observed in alternative methods. In cataract surgeries, tool and lens boundaries are maintained, which may support surgical navigation and decision-making.
This suggests that the adoption of state-space mechanisms for both temporal propagation (CSSP) and spatial detail enhancement (ISSR) confers a significant advantage in reconstructing diagnostically faithful imagery in challenging clinical video scenarios.
For further details and code, see the repository at https://github.com/CUHK-AIM-Group/MedVSR (Liu et al., 25 Sep 2025).