Res-Bench: Testing MLLM Resolution Robustness

Updated 26 October 2025

Res-Bench is a benchmarking suite that evaluates multimodal language models by systematically testing performance across 12 image resolution levels to reflect real-world variability.
It introduces innovative metrics such as Acc₍res₎, ACE, RCE, and Spearman’s ρ to precisely quantify both accuracy and stability, distinguishing between model-centric and task-centric robustness.
The framework demonstrates that preprocessing techniques like super-resolution and mixed-resolution fine-tuning can significantly enhance model resilience under variable image quality.

Res-Bench is a benchmarking suite specifically created to evaluate the robustness of multimodal LLMs (MLLMs) when presented with images at varying input resolutions. Unlike previous paradigms that primarily assess semantic accuracy, Res-Bench introduces quantitative stability metrics to characterize performance volatility as image resolution changes. This enables rigorous diagnosis of both model-centric and task-centric resolution robustness, furnishing insights into the reliability and adaptability of MLLMs in real-world scenarios where input quality is variable.

1. Dataset Construction and Structure

Res-Bench is constructed from 1,200 manually verified image–question samples spanning a broad set of multimodal vision–language tasks. Each base sample is systematically downsampled to produce 12 distinct resolution levels, aggregating to 14,400 total instances. These resolution levels allow fine-grained probing of model behavior under controlled degradations of image quality.

The benchmark encompasses six primary capability dimensions: 1. Coarse-grained Perception (e.g., style, scene type) 2. Fine-grained Perception (e.g., attribute, location, counting) 3. Instance Reasoning (e.g., cross-instance, spatial relations) 4. Logical Reasoning (e.g., chart/diagram, science/tech) 5. Mathematical Reasoning (e.g., function calculation, geometry, statistics) 6. Optical Character Recognition (OCR; key information, scene-text understanding)

Each dimension is further subdivided into 15 sub-capabilities, supporting thorough cross-domain and fine-grained evaluation for visual–textual tasks involving diverse semantic and structural demands.

2. Robustness-Oriented Evaluation Metrics

To move beyond standard accuracy measurement, Res-Bench introduces metrics that rigorously quantify how performance changes with input resolution:

Accuracy at Specific Resolution (Acc₍res₎): Binary scoring (1 if correct, 0 if not) of MLLM answers given images at each resolution level.

$\text{Acc}_{(\text{res})} = \text{Score}(\text{GT}, \text{MLLM}(I_{(\text{res})}))$

Average Accuracy (Acc₍avg₎): Aggregates accuracy across all resolution levels for a model/sample.

$\text{Acc}_{(\text{avg})} = \frac{1}{N_\text{res}} \sum_{i=1}^{N_\text{res}} \text{Acc}_{(\text{res}_i)}$

Spearman’s Correlation Coefficient ( $\rho$ ): Measures monotonicity between accuracy and resolution rank; high $\rho$ indicates stable, predictable degradation (or improvement).

$\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$

Absolute Continuous Error (ACE): Sums the absolute difference in accuracy between adjacent resolution levels, capturing volatility:

$\text{ACE} = \sum_{i=1}^{n-1} |\text{Acc}_{(\text{res}_{i+1})} - \text{Acc}_{(\text{res}_i)}|$

Relative Continuous Error (RCE): Normalizes ACE by average accuracy to allow fair comparison of fluctuation magnitude across models with differing overall competence:

$\text{RCE} = \frac{\text{ACE}}{\text{Acc}_{(\text{avg})}}$

Together, these metrics facilitate both absolute and stability-based assessment, enabling nuanced understanding of model behaviors that would otherwise be concealed under pure accuracy reporting.

3. Model-Centric and Task-Centric Robustness Analysis

Evaluation protocols encompass two axes:

Model-centric: Comprehensive testing of SOTA MLLMs, both proprietary and open-source, reveals a pervasive lack of resolution robustness. Models able to process images natively at high resolution often achieve superior peak accuracy, but their performance exhibits abrupt drops or non-monotonic shifts as input resolution decreases. Patch-based approaches distributing processing across smaller image regions confer improved consistency (lower ACE/RCE) but at the expense of lower absolute accuracy ceilings.
Task-centric: Resolution sensitivity is highly task-dependent. Coarse-grained perception tasks are relatively robust to downsampling, maintaining performance over a broad resolution spectrum. In contrast, fine-grained tasks (e.g., counting, detailed attribute recognition, OCR) display pronounced susceptibility, with accuracy fluctuating sharply as resolution degrades. The combination of Spearman’s $\rho$ , ACE, and RCE uncovers cases where high average accuracy actually masks significant volatility.

A plausible implication is that optimal architecture selection and deployment must be tailored not just to average semantic needs, but to the risk profile of resolution-dependent instabilities for each target application.

4. Preprocessing Strategies: Padding and Super-Resolution

Two principal preprocessing strategies are evaluated for their effects on resolution robustness:

Padding: Non-information (typically zero) pixels are appended to low-resolution images, matching sequence length to high-resolution inputs. Moderate padding yields modest performance stabilization, but excessive padding dilutes information, increases irrelevant token proportion, and tends to reduce accuracy.
Super-resolution (SR): Application of a diffusion-based SR model (DiffIR) intelligently reconstructs missing details in downsampled images. Models using SR-enhanced inputs demonstrably outperform padding-only approaches both in accuracy and stability, suggesting that restoring input quality at the visual representation level is an effective method to bolster robustness. This suggests that incorporating advanced SR pre-processing pipelines may become essential for MLLMs targeting deployment in environments where image quality cannot be controlled.

5. Fine-Tuning for Resolution Robustness

Direct improvement of robustness is possible via model fine-tuning. The benchmark studies the effects of additional mixed-resolution training on Qwen2.5-VL-3B, emphasizing spatial reasoning.

Results show that fine-tuning on a balanced mixture of resolutions significantly reduces performance fluctuation (lower ACE/RCE) and increases overall stability, albeit with a minor trade-off: sometimes a loss in peak performance on out-of-domain tasks. This suggests that robustness is a learnable, controllable trait, with practical training pipeline implications for domain optimization where volatile input quality is anticipated.

6. Implications and Applications

The Res-Bench framework introduces operational metrics and methodologies for resolution robustness in MLLMs, establishing a standard by which models can be assessed not only for competence, but for reliability across real-world input variability. This benchmarking is critical for applications in fields such as healthcare, autonomous systems, scientific imaging, and OCR, where images may be captured under varying conditions, and consistent performance is more valuable than isolated peaks.

The diagnostic insight provided by ACE, RCE, and $\rho$ metrics—augmented by architectural and training strategy evaluation—enables informed decisions regarding model selection, pipeline design, and further research into continual learning approaches for robust perception.

7. Future Directions

Further work may extend Res-Bench by increasing the diversity of modalities, augmenting the number of resolution levels, or integrating more complex, variable input artifacts (e.g., compression, occlusion). Systematic exploration of advanced SR methods and hybrid processing architectures is suggested. Integrating robustness metrics into model selection and automated deployment workflows will enhance operational reliability in practical MLLM systems.

In summary, Res-Bench constitutes a comprehensive, quantitative standard for benchmarking resolution robustness in MLLMs, enabling both fine-grained technical analysis and practical engineering improvements for multimodal perception under dynamic input conditions (Li et al., 19 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Res-Bench: Benchmarking the Robustness of Multimodal Large Language Models to Dynamic Resolution Input (2025)

Res-Bench: Testing MLLM Resolution Robustness

1. Dataset Construction and Structure

2. Robustness-Oriented Evaluation Metrics

3. Model-Centric and Task-Centric Robustness Analysis

4. Preprocessing Strategies: Padding and Super-Resolution

5. Fine-Tuning for Resolution Robustness

6. Implications and Applications

7. Future Directions

Whiteboard

Follow Topic

Continue Learning

Res-Bench: Testing MLLM Resolution Robustness

1. Dataset Construction and Structure

2. Robustness-Oriented Evaluation Metrics

3. Model-Centric and Task-Centric Robustness Analysis

4. Preprocessing Strategies: Padding and Super-Resolution

5. Fine-Tuning for Resolution Robustness

6. Implications and Applications

7. Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics