Multi-resolution Retrieval-Detection (MRD)
- Multi-resolution Retrieval-Detection (MRD) is a framework that fuses multi-scale semantic retrieval with open-vocabulary detection to overcome fixed low-resolution input limitations in MLLMs.
- It employs multi-resolution semantic fusion to combine cues from various crop sizes, preserving fine details and spatial coherence in high-resolution images.
- The integration of sliding-window object detection improves global object localization, boosting tasks like attribute recognition, spatial reasoning, and fine-grained analysis.
Multi-resolution Retrieval-Detection (MRD) is a training-free framework designed to enhance high-resolution (HR) image understanding in Multimodal LLMs (MLLMs). MRD addresses the difficulty imposed by fixed, low input resolutions common to MLLMs by integrating multi-resolution semantic retrieval and open-vocabulary object detection. These complementary components enable robust performance on tasks such as attribute recognition, spatial reasoning, and fine-grained object localization in high-resolution images (Yang et al., 2 Dec 2025).
1. Limitations of Prior High-Resolution Image Understanding Methods
High-resolution images typically cannot be fed directly into standard MLLMs due to their limited input size (e.g., 224×224 pixels). Direct downsampling leads to information loss, blurring, and reduction in the fidelity of small or spatially distributed objects. Early approaches such as "Zoom Eye" and "DC²" apply hierarchical or tree-based crop selection strategies but risk fragmenting objects or missing small details. Retrieval-Augmented Perception (RAP) extends the Retrieval-Augmented Generation (RAG) approach from language to vision by extracting uniformly sized crops and leveraging a pretrained vision RAG (VisRAG) model to compute per-crop semantic similarities against a query, subsequently selecting a set of crops for input into the MLLM.
However, RAP and similar single-resolution methods suffer several shortcomings:
- Large objects may be split across multiple patches, diluting their semantic similarity and retrieval consistency.
- Small objects are prone to disappearance at low resolutions.
- Crop resolution selection becomes a critical and sensitive hyperparameter.
These issues result in performance degradation on HR image understanding, particularly for precise object localization and reasoning tasks.
2. The MRD Framework: Architectural Components
MRD enhances RAP by jointly leveraging two strategies: Multi-resolution Semantic Fusion and Open-vocabulary Detection. The architectural pipeline consists of three core stages:
- Multi-resolution Semantic Fusion: Conduct per-patch retrieval at multiple crop resolutions and aggregate their semantic cues.
- Open-vocabulary Detection (OVD): Directly localize target objects globally using a sliding-window, open-vocabulary detector guided by context-extracted object classes.
- Semantic–Detection Fusion: Linearly combine semantic and detector confidence maps to produce a unified attention map for final RE (Retrieval-Exploration) search.
3. Multi-resolution Semantic Fusion
Given an HR image , MRD selects multiple crop resolutions (e.g., and ), partitions at each into non-overlapping patches , and proceeds as follows:
- Semantic Similarity Computation: For a text query , text and image patch embeddings are computed with encoders and . The cosine similarity score (normalized to ) for patch is:
- High-Resolution Map Projection: High-resolution similarity maps are mapped onto the finest grid, matching patches between resolutions using the mapping operator .
- Consistency Fusion: At each spatial index , fused similarity is obtained via geometric mean:
or, for resolutions,
- 2D Semantic Map Generation: The fused scores are reshaped into a 2D map .
The multi-resolution approach corrects low-resolution bias and preserves object integrity by amplifying high-resolution cues in fragmented or underscored regions.
4. Open-vocabulary Detection for Global Object Localization
The open-vocabulary detection (OVD) stage operates at the global level and employs sliding-window detection as follows:
- Object Class Extraction: An LLM parses the query to extract the set of target classes via in-context learning.
- Sliding-window Detection: The image is split into a grid (e.g., at resolution ), and detection windows (size , stride ) are defined. An open-vocabulary detector (LLMDet) predicts objects in each window, yielding sets of bounding boxes with confidence scores . Bounding boxes are filtered by confidence threshold .
- Per-window Confidence Mapping: For each window , a local confidence map propagates the maximum confidence among predicted boxes covering each patch.
- Global Detection Map Construction: Each patch aggregates detection signals from all windows containing it:
The OVD module enables explicit, class-level detection and supports localization at global image scale, even for open-ended queries.
5. Semantic–Detection Fusion and Guidance Map
MRD linearly combines semantic and detection maps to yield a final guidance map:
The weighting parameter determines the balance of semantic retrieval and detection cues. This fusion mechanism ensures both the intra-object detail derived from multi-resolution retrieval and the precise localization of open-vocabulary detection are exploited for final object and attribute grounding.
6. Experimental Protocol and Empirical Results
Datasets
Evaluation is conducted on V* Bench (2246×1582 images; attribute recognition and spatial reasoning), HRBench-4K, and HRBench-8K (Fine-grained Single-instance and Cross-instance Perception) (Yang et al., 2 Dec 2025).
Models and Baselines
- Open-source MLLMs: LLaVA-ov-0.5B, LLaVA-v1.5-7B, LLaVA-v1.6, InternVL, Yi-VL.
- Closed-source MLLMs: GPT-4o, Qwen-VL-Max.
- Baselines: Zoom Eye, DC², RAP.
Metrics
Task accuracy (%) is reported for Attribute, Spatial, FSP, FCP, and Overall measures.
Quantitative Results
MRD demonstrates consistent improvements:
| Method (V* Bench) | Overall (%) |
|---|---|
| RAP only | 83.6 |
| OVD only | 84.9 (+1.3) |
| RAP + Multi-res | 85.8 (+2.2) |
| RAP + OVD | 86.2 (+2.6) |
| RAP + OVD + Multi-res (MRD) | 89.3 (+5.7) |
On LLaVA-v1.5-7B, MRD yields 95.6% overall on V* Bench (RAP: 91.1%), with improvements of +2.6% on HRBench-4K and +1.1% on HRBench-8K. Gains manifest across both fine-grained single- and multi-object tasks.
Qualitative Findings
- Standard single-resolution semantic maps frequently under-cover large objects or flag irrelevant background regions.
- Multi-resolution fusion reconstructs fragmented or missing object segments.
- Combined detection maps suppress off-target areas and better maintain instance coverage in cross-instance settings.
7. Limitations, Prospective Enhancements, and Implications
Limitations
- Sliding-window open-vocabulary detection is computationally intensive, especially for very high-resolution images (e.g., 8K), though window size and stride provide trade-offs.
- The fusion weight is fixed across images; adaptive, context-sensitive weighting remains unexplored.
- OVD may be less effective for heavily occluded or extremely small objects.
Potential Extensions
- Learning dynamic fusion weights or gating mechanisms to combine maps per-query or per-image.
- Adaptive selection of crop resolutions based on image content or query complexity.
- Joint end-to-end fine-tuning of detection and retrieval encoders to minimize reliance on pretrained frozen modules.
- Integration of detection-derived region proposals into the Retrieval-Exploration process for efficiency.
Implications
MRD illustrates that combining multi-scale semantic retrieval with open-vocabulary detection is an effective paradigm for high-resolution visual understanding in MLLMs. Incorporation of these components enables robust handling of gigapixel or panoramic content without the prohibitive cost of full-image encoding. The framework indicates the value of adaptive, multi-resolution pipelines for next-generation MLLMs tasked with complex, long-context image analysis (Yang et al., 2 Dec 2025).