MedVSR: Medical Video Super-Resolution

Updated 3 July 2026

The paper presents MedVSR, a novel two-stage state-space framework that enhances clinical videos by addressing artifacts like camera shake and physiological motion.
It leverages Cross State-Space Propagation for robust cross-frame feature alignment and Inner State-Space Reconstruction for intra-frame detail enhancement.
Evaluations across multiple benchmarks show that MedVSR achieves superior PSNR/SSIM with fewer parameters and lower FLOPs, supporting improved diagnostic imaging.

MedVSR is a medical video super-resolution (VSR) framework designed to address the unique challenges of low-resolution (LR) clinical video data. Existing VSR models often fail to account for medical-specific artifacts—including camera shake, physiological motion, and abrupt scene changes—which degrade the alignment quality and introduce spurious features that may compromise diagnostic accuracy. MedVSR introduces a two-stage state-space modeling strategy, integrating Cross State-Space Propagation (CSSP) for cross-frame feature selection and Inner State-Space Reconstruction (ISSR) for intra-frame structural enhancement. This architecture delivers leading quantitative and qualitative results across multiple medical VSR benchmarks, while maintaining computational efficiency (Liu et al., 25 Sep 2025).

1. Architectural Overview

MedVSR operates on LR video inputs $F_1, \dots, F_T$ to produce high-resolution (HR) outputs $F'_1, \dots, F'_T$ with an upsampling factor (typically $\times4$ ). The architecture proceeds in three principle stages:

2D Convolutional Feature Extraction: Each LR frame is independently encoded to generate spatial features.
Bidirectional CSSP Propagation Branches: For each time step $t$ and branch $j$ , bidirectionally propagated features $\{f_{t-1}^j, f_{t-2}^j\}$ are aligned and filtered using a state-space mechanism, yielding $\hat{f}_t^j$ .
ISSR Fusion and Upscaling: Features $\{\hat{f}_t^1, ..., \hat{f}_t^J\}$ are fused and passed through a second state-space module for long-range spatial aggregation, followed by a large-kernel separable block (LKSB) for localized detail enhancement, generating $g_t$ . A final pixel-shuffle layer reconstructs the SR frame $F'_t$ .

This modular structure supports efficient sequence modeling with high-fidelity reconstruction, as evidenced by Figure 1 in (Liu et al., 25 Sep 2025).

2. Cross State-Space Propagation (CSSP)

CSSP extends the classical discrete linear state-space model (SSM):

$F'_1, \dots, F'_T$ 0

Discretized by zero-order hold with step $F'_1, \dots, F'_T$ 1, the update becomes:

$F'_1, \dots, F'_T$ 2

where $F'_1, \dots, F'_T$ 3 and $F'_1, \dots, F'_T$ 4.

Within CSSP, warp-aligned distant frame features $F'_1, \dots, F'_T$ 5 and neighbor frame features $F'_1, \dots, F'_T$ 6 are windowed. Frame-dependent parameters are estimated:

SSM input: $F'_1, \dots, F'_T$ 7
SSM input transform: $F'_1, \dots, F'_T$ 8
Control matrix: $F'_1, \dots, F'_T$ 9

Hidden state $\times4$ 0 is initialized from $\times4$ 1. For each token $\times4$ 2:

$\times4$ 3

Tokens are reassembled and gated to yield $\times4$ 4, which is deformably convolved with $\times4$ 5 to produce the aligned output $\times4$ 6. CSSP thereby filters spatio-temporal features, preserving only consistent components across frames.

3. Inner State-Space Reconstruction (ISSR)

Post-CSSP features are fused by $\times4$ 7 convolution, partitioned into windows $\times4$ 8, and processed via the Inner State-Space Block (ISSB):

Data-dependent SSM parameters (input, transforms, gate) are computed from $\times4$ 9.
A window-wise state-space scan is performed:

$t$ 0

and tokens are aggregated:

$t$ 1

The LKSB module refines spatial structure:

$t$ 2

Using depthwise and pointwise convolution factorization with kernel size $t$ 3.

ISSR thus combines nonlocal tissue structure preservation with local texture aggregation, critical for artifact suppression and boundary restoration.

4. Training Regimen and Propagation Variants

MedVSR is trained with the Charbonnier loss:

$t$ 4

No adversarial or perceptual objectives are used. Adam optimizer is set with initial lr $t$ 5, cosine decayed to $t$ 6 over $t$ 7k iterations. Training uses $t$ 8 crops degraded by bicubic $t$ 9 downsampling and i.i.d. Gaussian noise ( $j$ 0). Four propagation branches ( $j$ 1) are used. All LR test inputs are generated consistently with this synthetic degradation scheme.

5. Evaluation Datasets and Preprocessing

MedVSR is benchmarked on four scenarios:

HyperKvasir: Gastrointestinal endoscopy; standard train/val/test split.
LDPolypVideo: Colonoscopy polyp scenes.
EndoVis18: Robotic surgical instruments with abrupt scene changes.
Cataract-101: Ophthalmic cataract surgery videos (101 sequences).

All training and test LR samples use bicubic downsampling plus Gaussian noise. For domain specialization, separate models are trained and evaluated for gastrointestinal and cataract scenarios.

6. Performance Metrics and Comparative Results

Quantitative assessment is provided for PSNR, SSIM, model size, FLOPs, and latency (4090 GPU). MedVSR achieves the leading scores across all datasets (see Table 1, Table 2 below; abbreviations: HK = HyperKvasir, LP = LDPolyp, EV = EndoVis18):

Model	Params (M)	FLOPs (T)	HK-PSNR/SSIM	LP-PSNR/SSIM	EV-PSNR/SSIM	Cataract PSNR/SSIM
EDVR	20.6	45.4	27.12/0.8483	30.08/0.8417	21.67/0.7883	31.78/0.9103
BasicVSR	6.3	8.6	31.46/0.8990	31.68/0.8650	30.72/0.8950	35.29/0.9232
BasicVSR++	7.3	9.5	31.73/0.9042	31.67/0.8620	30.80/0.8958	35.54/0.9251
VSRT	32.6	112.6	31.78/0.8999	31.63/0.8596	30.59/0.8916	35.25/0.9230
RVRT	10.8	44.4	29.26/0.8944	31.32/0.8578	30.72/0.8932	35.64/0.9281
TCNet	9.6	39.6	31.11/0.8889	31.39/0.8547	30.17/0.8859	34.60/0.9177
IART	13.4	44.7	31.30/0.9030	31.70/0.8671	30.47/0.8924	36.08/0.9275
MedVSR (Ours)	7.2	9.5	32.10/0.9069	31.83/0.8673	30.83/0.8960	36.23/0.9308

MedVSR has the fewest parameters among top performers and the lowest or near-lowest FLOPs and runtime.

Ablation analyses show PSNR drops of 0.29, 0.28, and 0.36 dB when removing position embedding, local windows, or separate projections in CSSB, respectively (Table 3). In ISSB, eliminating windows or replacing the concatenation-gate results in PSNR changes of $j$ 2 and $j$ 3 dB. The LKSB is optimal at $j$ 4 kernel, with performance drops for other sizes.

7. Qualitative Analysis and Significance

MedVSR demonstrates effective suppression of motion-jitter and alignment artifacts, as shown in qualitative comparisons (Figure 2). It preserves tissue textures and subtle boundaries (Figure 3 and Figure 4), avoiding the creation of hallucinated structures observed in alternative methods. In cataract surgeries, tool and lens boundaries are maintained, which may support surgical navigation and decision-making.

This suggests that the adoption of state-space mechanisms for both temporal propagation (CSSP) and spatial detail enhancement (ISSR) confers a significant advantage in reconstructing diagnostically faithful imagery in challenging clinical video scenarios.

For further details and code, see the repository at https://github.com/CUHK-AIM-Group/MedVSR (Liu et al., 25 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedVSR.