HR-Bench 4K: Benchmarking High-Res MLLMs

Updated 4 October 2025

HR-Bench 4K is a benchmark designed to evaluate high-resolution multimodal models by testing fine-grained single-instance and cross-instance perception tasks.
It employs manually curated 4K image crops from 8K sources to provide precise object-level attribute evaluations and spatial reasoning analyses.
The DC² framework enhances MLLM performance by dividing, processing, and recombining image patches to mitigate information loss from fixed patch encoders.

HR-Bench 4K serves as a rigorously defined benchmark dedicated to evaluating the ability of multimodal LLMs (MLLMs) to perceive and reason over 4K-resolution images. Unlike existing benchmarks, which typically limit assessment to ≤2K images, HR-Bench 4K specifically targets the quantification of model performance and granularity of understanding with 4K image content. This enables systematic measurement of model limitations and advances in fine-grained visual and cross-modal reasoning in high-resolution scenarios, where significant visual information is frequently lost due to downsampling.

1. Motivation and Design of HR-Bench 4K

HR-Bench 4K was developed to close the gap in evaluating MLLMs on content that matches the increasingly high resolutions of modern imaging pipelines and display hardware. Existing state-of-the-art models claim support for 4K image inputs; however, their practical ability to process and extract intricate details is largely untested, mainly due to the absence of appropriately scaled benchmarks (Wang et al., 2024).

The HR-Bench suite consists of two versions:

HR-Bench 8K: Utilizes full 8K images sourced from public datasets and manual curation.
HR-Bench 4K: Derives 4K content by cropping object-centric regions from 8K images, enabling object-level attribute evaluations in a 4K regime.

Each image is paired with manually-annotated query–answer pairs, meticulously classified into two major tasks:

Fine-grained Single-instance Perception (FSP): Targeting the extraction of attributes (color, material, embedded text) from individual objects.
Fine-grained Cross-instance Perception (FCP): Assessing the model’s capacity to reason about spatial relations, multi-object awareness, and analysis of complex layouts (e.g., charts, maps).

The rationale for separately benchmarking on 4K-cropped images is to ensure detailed evaluation of object properties unencumbered by excessive visual clutter or context loss, facilitating the discrimination between local object representation and broader scene understanding.

2. Impact of Downsampling and Benchmark Findings

The evaluation protocol embedded in HR-Bench 4K exposes a substantial performance gap between machine and human perception at 4K resolution. State-of-the-art (SOTA) MLLMs (e.g., InternVL-2-llama3-76B) achieve up to 82% accuracy on FSP tasks and around 60% on FCP tasks with an average accuracy of approximately 71%, while human baselines consistently reach higher scores (about 82% on 4K tasks).

This discrepancy is principally attributed to the forced downsampling inherent in current MLLM architectures. Typical visual encoders (e.g., CLIP preprocessors) accept only fixed-size patches (commonly 336×336 px), which leads to irremediable loss of pixel-level information and fine detail present in the original 4K image. The consequence is degraded recognition of subtle attributes (e.g., small inscriptions, precise textures, spatial relationships), even in models that ingest “high-resolution” images by construction.

3. The DC² Framework: Methodology

To systematically compensate for downsampling-induced information loss, the authors propose DC² (Divide, Conquer, and Combine)—a training-free, plug-and-play enhancement for MLLMs operating on HR-Bench 4K images (Wang et al., 2024). The methodology is as follows:

Divide: The input 4K image is recursively partitioned into smaller patches compatible with the MLLM’s encoder. Hierarchical clustering is applied at each recursion step, merging similar patches based on a similarity metric and threshold θ to prevent combinatorial explosion:
- The image patch $v_{\ell}$ is split via: ${\bar{v}_i}_{i=1}^4 = F_\text{crop}(v_\ell)$ ,
- Clustered as: $[C_1, ..., C_k] = HC({\bar{v}_i}_{i=1}^4, \theta)$ ,
- Merged via: $v_{\ell+1,i} = (1/|C_i|) \sum_{\bar{v} \in C_i} \bar{v}$ .
Conquer: Each resulting patch is independently processed: for leaf patches, the MLLM is prompted (e.g., “Please describe this image”); for non-leaf nodes, generated descriptions from child patches are collected and refined by the model using more complex prompts.
- For leaves: $T_{\ell}, O_{\ell} = F_\text{leaf}(v_{\ell})$ ,
- For non-leaves: $T_{\ell}, O_{\ell} = F_\text{non-leaf}(v_{\ell}, T_{\ell+1})$ , where $T_{\ell}$ is the text description, $O_{\ell}$ is the set of detected objects.
Combine: Reconstructed global context is achieved by intersecting detected objects at each recursion level to eliminate hallucinated or fragmented instances: $\widehat{O}_{\ell} = O_{\ell} \cap O_{\ell+1}$ . Relevant object information (with coordinates and confidence scores) is then aggregated in a visual memory ( $\mathcal{M}$ ). At inference, a retriever matches user queries to the most salient memory entries and concatenates their descriptions with the original prompt for robust final answer generation.

This approach is explicitly training-free and model-agnostic, operating as a structured pre-/post-processing pipeline around any MLLM.

4. Experimental Outcomes and Comparative Analysis

Empirical results on HR-Bench 4K demonstrate:

SOTA MLLMs, without DC², exhibit marked deficits relative to human accuracy in both FSP and FCP tasks (maximum model FSP: ~82%; human: ~82%).
Application of DC² produces consistent improvements, e.g., InternVL-1.5-26B demonstrates a +5.7% absolute gain in FSP and +3.4% average improvement on HR-Bench 8K; similar relative improvements are observed for 4K cases on a variety of public and commercial models.
The relative gains extend to generic multimodal benchmarks, not only 4K-specific scenarios, and reductions in object hallucination are observed based on intersective filtering.

These findings indicate that data partitioning, patch-level semantic enrichment, and memory-based recombination can effectively compensate for fixed-encoder bottlenecks in high-resolution vision-language tasks.

Model	FSP (%)	FCP (%)	Average (%)	Human Benchmark (%)
InternVL-2-llama3-76B	82	60	71	82 (4K)
LLaVA variants	up to 3% gain w/ DC²	–	1–3% improvement	–

5. Technical Considerations and Implementation

The pipeline remains training-free and does not require alteration of internal model weights—constituting an augmentation and not a retraining regime. Key technical details include:

Patch merging via hierarchical clustering maximizes efficiency, limiting both GPU memory consumption and call overhead. The splitting/merging threshold θ must be set based on application demands and model sensitivity.
Visual memory $\mathcal{M}$ encodes each object as a tuple $(x, y, w, h)$ with associated textual descriptors and confidence values. Retrieval at inference uses thresholding (α) to manage trade-offs between recall and precision.
Object fragmentation and hallucination are major risks with aggressive partitioning; intersective aggregation and object persistence tracking through recursion help mitigate these artifacts, but require careful threshold calibration.
The entire process is compatible with batch MLLM inference and is expected to scale linearly with patch count and model inference time.

A plausible implication is that, beyond HR-Bench, the DC² framework can be modularly integrated into workflows for OCR, spatial reasoning, and content verification in broader multimodal systems operating on high-resolution visual data.

6. Implications and Future Directions

HR-Bench 4K highlights the need for and utility of evaluation corpora that match realistic, high-detail visual demands encountered in production and research. By offering a rigorous 4K benchmark and establishing a practical, model-agnostic enhancement strategy, the approach enables:

Development of models with explicit local-global fusion strategies for vision encoding.
A paradigm where textual and patch-level context are used synergistically to offset downsampling losses instead of pursuing cost-prohibitive instruction-tuning on original HR data.
Expansion to advanced token compression (token merging) to enable end-to-end transformer-based processing of larger images while retaining global structure—a direction suggested in the paper.

HR-Bench 4K thus sets a new empirical standard for benchmarking multimodal models in the 4K regime and provides a roadmap for further research into model architectures, post-processing strategies, and cross-modal reasoning in high-detail tasks (Wang et al., 2024).

PDF Markdown Chat (Pro)

References (1)

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to HR-Bench 4K.