AIM 2025 Benchmark Overview
- AIM 2025 Benchmark is a comprehensive suite of state-of-the-art evaluation tasks in image processing, generative modeling, and scientific machine learning that leverages diverse, domain-specific datasets.
- The benchmark outlines strict evaluation protocols and efficiency budgets across tasks such as RAW video denoising, motion deblurring, perceptual super resolution, inverse tone mapping, and AI idea generation.
- It enforces reproducibility and open science by mandating standardized submission procedures, public data/code releases, and rigorous metric-based comparisons to ensure fair algorithm assessments.
The AIM 2025 Benchmark refers to a series of rigorously designed benchmarks covering diverse, state-of-the-art challenges in image processing, generative modeling, and scientific machine learning, all released and evaluated as part of the 2025 Advances in Image Manipulation (AIM) workshop series. It encompasses multiple public datasets, standardized evaluation protocols, and curated challenge tasks intended to foster fair, reproducible comparison of new algorithms in both low-level computer vision restoration and emerging AI domains.
1. Benchmark Scope and Dataset Design
AIM 2025 introduces domain-specific datasets and evaluation tracks in areas including low-light RAW video denoising, high FPS non-uniform motion deblurring, efficient perceptual super resolution, real-world RAW image denoising, efficient single-image deblurring, rip current instance segmentation, inverse tone mapping, and scientific idea generation. Each dataset is constructed with explicit attention to domain realism, ground-truth rigor, and cross-device/cross-condition generalization:
- Low-Light RAW Video Denoising: 756 ten-frame video bursts from 14 smartphone sensors, with exposure and illuminance variation and high-SNR references via burst averaging. Test scenes are held out for fair evaluation (Yakovenko et al., 22 Aug 2025).
- High FPS Motion Deblurring: MIORe dataset, with algorithmic motion blur synthesis based on high-speed camera sequences, designed for moderate and extreme blur tracks (Ciubotariu et al., 8 Sep 2025).
- Efficient Perceptual Super Resolution (PSR4K): 500 4K test images degraded by multiple blind degradations for NR-IQA-based assessment under compute/parameter constraints (Longarela et al., 14 Oct 2025).
- Real-World RAW Image Denoising: Four DSLR models, extensive low-light indoor/outdoor captures, with per-image dark frames and calibrated gain/offset metadata enabling precise synthetic noise modeling (Li et al., 8 Oct 2025).
- Efficient Real-World Deblurring: New 420-pair test split from the RSBlur data, using a beam-splitter dual-camera apparatus for pixel-precise blurred/sharp pairs (Feijoo et al., 14 Oct 2025).
- Rip Current Segmentation: RipVIS corpus of 27,718 annotated images across diverse geographies and rip types for robust instance segmentation under natural visual ambiguity (Dumitriu et al., 18 Aug 2025).
- Inverse Tone Mapping (ITM): ~19,000 simulated LDR/HDR training pairs, 100 validation and 100 test images, using a perceptually uniform photometric encoding (Wang et al., 19 Aug 2025).
- AI Research Idea Generation: AI Idea Bench 2025, comprising 3,495 AI papers and 5 “inspiration references” per paper, with paired automatic ground-truth summaries (Qiu et al., 19 Apr 2025).
2. Challenge Tasks and Protocols
The AIM 2025 Benchmarks precisely specify the supervised/regression, enhancement, or generation problem studied, along with the procedure for valid participation and test submission:
- Input/Output Constraints: Direct RAW, mosaicked, or LDR inputs must be processed with strict limitations; e.g., no demosaicking, ISP, or non-linear color conversion is permitted in RAW denoising/denoising tracks (Yakovenko et al., 22 Aug 2025, Li et al., 8 Oct 2025).
- Submission Requirements: Test submissions are sometimes limited to central crops, RAW format, or byte-precision to prevent trivial overfitting or data leak. Reproducibility is enforced by mandatory code and reproducibility checks (Dumitriu et al., 18 Aug 2025).
- Efficiency Budgets: Many tracks enforce strict model constraints (e.g., ≤5M parameters and ≤2,000 GFLOPs in super resolution, ≤200 GMACs in efficient deblurring) to reflect deployment needs (Longarela et al., 14 Oct 2025, Feijoo et al., 14 Oct 2025).
- Holdout Test Sets: Benchmark leaderboards are determined solely by performance on private, previously unreleased test splits, ensuring strong generalization validity (Yakovenko et al., 22 Aug 2025, Li et al., 8 Oct 2025).
3. Evaluation Metrics and Ranking
Each benchmark adopts domain-relevant objective metrics and consensus ranking procedures, calibrated to the task's intended end goals:
| Track | Metrics (Higher is better—unless ↓) | Aggregation/Ranking |
|---|---|---|
| RAW Video Denoising | PSNR, SSIM | Mean(rank_PSNR, rank_SSIM) |
| Motion Deblurring | PSNR, SSIM, LPIPS (↓) | Mean(ranks), LPIPS as perceptual |
| Super-Resolution (PSR4K) | PI (↓), CLIP-IQA, MANIQA | Weighted exp-based score, lower is better |
| Real-World RAW Denoising | PSNR, SSIM, LPIPS (↓), ARNIQA, TOPIQ | Mean ranks, stratified by scene type |
| Rip Current Segmentation | F₁, F₂, AP₅₀, AP₍₅₀:₉₅₎ | 0.3×F₁ + 0.3×F₂ + 0.3×AP₅₀ + 0.1×AP₍₅₀:₉₅₎ |
| Inverse Tone Mapping | PU21-PSNR, PU21-SSIM | Report best/mean, model size as reference |
| Idea Generation | Alignment (I2T, I2I, IMCQ), Reference-based (IC, NA, FA, FPS) | No single “F”, all subscores reported |
All metrics are computed under standardized pre-processing (e.g., pixel alignment, RAW domain, perceptually uniform encoding) as defined in the respective challenge protocols.
4. State-of-the-Art Baseline Methods
AIM 2025 explicitly benchmarks a diverse suite of architectures—ranging from CNNs and transformers to generative diffusion and hybrid pipelines—under normalized training/evaluation and efficiency constraints. Brief highlights:
- NAFNet and Variants: Backbone of U-Net with NAFBlocks in denoising, deblurring, and ITM, often augmented with lightweight or global attention, and reparameterization for efficiency (Yakovenko et al., 22 Aug 2025, Li et al., 8 Oct 2025, Jbara, 15 Oct 2025, Feijoo et al., 14 Oct 2025).
- Restormer and Transformer Models: Multi-Dconv Head Transposed Attention, Gated Dconv FFN, and progressive scale training in real-world deblurring and denoising (Feijoo et al., 14 Oct 2025, Li et al., 8 Oct 2025).
- VPEG/EVSSM: Frequency-domain transformer hybrids, efficient state-space mixing, and high-quality recovery in deblurring and perceptual super-resolution (Ciubotariu et al., 8 Sep 2025, Longarela et al., 14 Oct 2025).
- Domain Generalization and Data Synthesis: Unsupervised domain adaptation (RipSeg HRDA), hybrid noise models (e.g., PMNNP, dark-frame resampling), multi-scale neural augmentation (FACM), and diffusion-based augmentation (HDRer) (Dumitriu et al., 18 Aug 2025, Li et al., 8 Oct 2025, Wang et al., 19 Aug 2025).
Reported leaderboard results show that lightweight designs with tailored attention and multi-stage training now match or exceed larger models in most fidelity and perceptual tracks, especially when robust data synthesis and domain adaptation are implemented.
5. Reproducibility, Open Science, and Best Practices
AIM 2025 enforces strict dataset splits, code release, and standardization to maximize comparability and community utility:
- Repository Structure: All data (including RAWs, clean/dark, degradations, annotations) and baseline code are released via public GitHub/Codalab links, with prescribed input pipelines and inference APIs (Yakovenko et al., 22 Aug 2025, Li et al., 8 Oct 2025).
- Documentation and Quick Start: Each challenge provides demonstration notebooks (e.g., for CIF parsing, graph construction, evaluation scripts), and explicit model conversion paths for deployment in PyTorch Geometric, DGL, or competition-specific frameworks (Han et al., 4 Jun 2025, Li et al., 8 Oct 2025).
- Extension Procedures: Datasets are open for pull requests, with prespecified protocols for new data inclusion and recomputation of derived features or labels using consistent tools and pseudopotentials (Han et al., 4 Jun 2025).
- Best Practices: Strict exclusion of test entries from pre-training/validation, error breakdown by class or system type, reporting of both fidelity and perceptual metrics, and consideration of class imbalance or domain shifts in modeling and evaluation (Han et al., 4 Jun 2025, Li et al., 8 Oct 2025).
6. Impact, Key Insights, and Future Directions
The AIM 2025 Benchmarks have demonstrably advanced the state-of-the-art in multiple image restoration and generative AI subfields, as evidenced by leaderboards, method diversity, and metric advances:
- Substantial performance improvements: Multi-frame/temporal aggregation is shown to provide significant gains over single-frame baselines in extreme RAW denoising; efficient attention or frequency-aware models demonstrate near-oracle performance at fractions of previous compute cost (Yakovenko et al., 22 Aug 2025, Feijoo et al., 14 Oct 2025).
- Evaluation bottlenecks: NR-IQA metrics for perceptual SR and deblurring do not always penalize hallucinated details or artifacts, indicating an open need for more artifact-sensitive, robust evaluation frameworks (Longarela et al., 14 Oct 2025).
- Open research questions: Persistent failures include over-smoothing, residual artifacts in low SNR, and imperfect domain generalization—highlighted by leaderboard plateaus in RipSeg (max 0.68 composite score), super-resolution, and extreme deblurring (Dumitriu et al., 18 Aug 2025, Longarela et al., 14 Oct 2025, Ciubotariu et al., 8 Sep 2025).
- Future directions: These include enforcing even stricter resource constraints, integrating metadata-aware or self-adaptive models, expanding datasets to cover additional modalities (e.g., temporal or cross-domain), and extending benchmarks to unsupervised or self-supervised protocols (Li et al., 8 Oct 2025, Qiu et al., 19 Apr 2025).
The AIM 2025 Benchmarks serve as reproducible testbeds for the next generation of low-level vision and scientific machine learning algorithms, enabling researchers to conduct credible, head-to-head comparisons that meaningfully drive progress in the field.