- The paper presents RC-Bench, the first resolution-centric benchmark that evaluates VLM performance on images with varied resolutions and extreme aspect ratios.
- It proposes NativeRes-LLaVA, an open-source framework that processes native-resolution images using advanced techniques like 2D RoPE and multimodal sequence packing.
- Experiments show that native visual encoding improves fine-grained detail recognition and robustness on resolution-sensitive tasks compared to cropping-based methods.
The paper "Native Visual Understanding: Resolving Resolution Dilemmas in Vision-LLMs" (2506.12776) addresses significant challenges faced by Vision-LLMs (VLMs) when dealing with the wide variety of image resolutions and aspect ratios found in real-world visual data. Most existing VLMs are designed to process images at a fixed, often low, resolution, leading to a "Resolution Dilemma" where models struggle with tasks requiring fine-grained visual understanding. This dilemma is compounded by existing benchmarks that fail to adequately evaluate VLM performance under diverse visual conditions.
To tackle this, the authors introduce two main contributions: RC-Bench and NativeRes-LLaVA.
RC-Bench: A Resolution-Centric Benchmark
RC-Bench is presented as the first benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, emphasizing resolution and aspect ratio variations. The paper analyzes existing multimodal benchmarks and finds they often lack images with extreme aspect ratios (very wide or very tall) and diverse resolution distributions. RC-Bench is constructed by carefully balancing the distribution of images across different area (resolution) and aspect ratio categories, integrating existing public datasets with proprietary data. The dataset creation process involves selecting high-quality images, manual verification, and employing padding and resizing to augment distributions of extreme cases. This results in 1750 images covering seven resolution levels and five aspect ratio categories.
Question-answer pairs for RC-Bench are generated using a hybrid approach: automated generation with GPT-4o followed by manual screening by human annotators. Questions are designed to require strong visual grounding and often focus on recognizing fine-grained details like text, numbers, or chart elements, classifying them as "Resolution-Centric" tasks, distinct from "Semantic-Centric" tasks (like object recognition) that are less dependent on high resolution. The evaluation methodology uses Exact Match (EM) for concise answers and Average Normalized Levenshtein Similarity (ANLS) for longer ones, incorporating normalization strategies for units and phrasing to enhance robustness. RC-Bench evaluates models by providing scores across different resolution and aspect ratio dimensions, offering a detailed view of performance stability.
NativeRes-LLaVA: A Framework for Native Visual Encoding
NativeRes-LLaVA is proposed as an open-source training framework to enable VLMs to effectively process images at their native resolutions and aspect ratios, thereby preserving maximum image detail. The framework architecture consists of four main modules:
- A native resolution Vision Encoder, specifically designed to handle varying resolutions and aspect ratios using 2D Rotary Position Embedding (RoPE) for positional encoding.
- A compression module to reduce the number of visual tokens from the encoder, improving efficiency.
- A two-layer Multilayer Perceptron (MLP) to project visual features into the language embedding space.
- An advanced LLM backbone.
A key technical contribution for efficient processing of variable-length visual sequences is the use of multimodal sequence packing, inspired by NaViT's Patch n' Pack [dehghani2023patchnpacknavit]. Instead of padding variable-length patch sequences to a fixed maximum length, patches from multiple images in a batch are concatenated into a single packed sequence. Variable Length Flash Attention [dao2023flashattention2] is used within the Vision Transformer to isolate attention computations for each image's patch sequence within the packed sequence, reducing computational redundancy compared to standard padding.
The framework is open-sourced, including training codebases for pre-training and visual instruction tuning, and supports flexible configuration, addressing the fragmentation in existing open-source native resolution VLM efforts.
Implementation and Experiments
The authors implement NativeRes-LLaVA using Qwen2-7B-Instruct [yang2024qwen2] as the LLM and a native resolution ViT initialized from Qwen2-VL-2B [wang2024qwen2]. Training is performed on 8 NVIDIA A100-80G GPUs, following a two-stage process:
- VLM pre-training: Fine-tuning the Visual Projector on the LLaVA-Pretrain dataset (558K) with frozen encoder and LLM.
- Visual Instruction Tuning: Two configurations are used: one fine-tuning the projector and LLM on LLaVA-mix665k (665K), and another fine-tuning all components, including the Vision Encoder, on LLaVA-NeXT-Data (779K), which includes OCR-specific examples.
Experiments are conducted on a wide range of general benchmarks, categorized as Resolution-Centric (e.g., TextVQA, OCRBench, DocVQA, ChartQA, HR-Bench) and Semantic-Centric (e.g., MMBench, MME, MathVista, AI2D). The evaluation protocol assesses performance based on training data size, maximum supported resolution (MaxRes), resolution strategy, and capabilities on both types of tasks, including detailed evaluation on RC-Bench using accuracy, Area Coefficient of Variation (ACV), and Ratio Coefficient of Variation (RCV).
The results show that NativeRes-LLaVA achieves superior performance on Resolution-Centric tasks compared to baselines like LLaVA-NeXT, even with less specialized data initially. Training with data including RC-type samples dramatically improves performance on resolution-sensitive benchmarks. The model also maintains competitive performance on Semantic-Centric tasks. Ablation studies demonstrate that the performance gains are primarily due to the Native Resolution strategy itself, not just a better Vision Transformer, and that higher MaxRes significantly improves performance on RC tasks. A detailed comparison between Cropping-based and Native Resolution strategies on RC-Bench reveals that while Cropping can perform well in specific cases that align with pre-training (e.g., balanced aspect ratios at near-standard resolutions), Native Resolution shows superior robustness and performance on images with extreme aspect ratios or areas. Experiments across different LLM scales and backbones further validate the effectiveness and compatibility of the Native Resolution strategy.
Limitations
The paper acknowledges limitations, including computational resource constraints that prevented training a state-of-the-art model from scratch and performing large-scale training comparable to commercial models. The model's effective resolution generalization is limited by the distribution of existing training datasets, which may not contain sufficient high-resolution or extreme aspect ratio examples to fully utilize the architecture's capabilities. Furthermore, the quadratic scaling of self-attention with the number of patches remains a computational bottleneck for processing extremely high-resolution images, although techniques like window attention (as used in some newer models) are mentioned as future directions.
Conclusion
In conclusion, the paper highlights the critical "Resolution Dilemma" in VLMs and contributes a novel benchmark (RC-Bench) and an open-source framework (NativeRes-LLaVA) to address it. The empirical results strongly support the superiority of native resolution visual encoding for tasks requiring fine-grained detail, demonstrating improved accuracy and robustness across diverse visual conditions. The work provides a systematic framework for resolution-centric evaluation and a practical, open-source approach for developing VLMs capable of handling native-resolution inputs.