Multi-resolution Retrieval-Detection (MRD)

Updated 7 December 2025

Multi-resolution Retrieval-Detection (MRD) is a framework that fuses multi-scale semantic retrieval with open-vocabulary detection to overcome fixed low-resolution input limitations in MLLMs.
It employs multi-resolution semantic fusion to combine cues from various crop sizes, preserving fine details and spatial coherence in high-resolution images.
The integration of sliding-window object detection improves global object localization, boosting tasks like attribute recognition, spatial reasoning, and fine-grained analysis.

Multi-resolution Retrieval-Detection (MRD) is a training-free framework designed to enhance high-resolution (HR) image understanding in Multimodal LLMs (MLLMs). MRD addresses the difficulty imposed by fixed, low input resolutions common to MLLMs by integrating multi-resolution semantic retrieval and open-vocabulary object detection. These complementary components enable robust performance on tasks such as attribute recognition, spatial reasoning, and fine-grained object localization in high-resolution images (Yang et al., 2 Dec 2025).

1. Limitations of Prior High-Resolution Image Understanding Methods

High-resolution images typically cannot be fed directly into standard MLLMs due to their limited input size (e.g., 224×224 pixels). Direct downsampling leads to information loss, blurring, and reduction in the fidelity of small or spatially distributed objects. Early approaches such as "Zoom Eye" and "DC²" apply hierarchical or tree-based crop selection strategies but risk fragmenting objects or missing small details. Retrieval-Augmented Perception (RAP) extends the Retrieval-Augmented Generation (RAG) approach from language to vision by extracting uniformly sized crops and leveraging a pretrained vision RAG (VisRAG) model to compute per-crop semantic similarities against a query, subsequently selecting a set of crops for input into the MLLM.

However, RAP and similar single-resolution methods suffer several shortcomings:

Large objects may be split across multiple patches, diluting their semantic similarity and retrieval consistency.
Small objects are prone to disappearance at low resolutions.
Crop resolution selection becomes a critical and sensitive hyperparameter.

These issues result in performance degradation on HR image understanding, particularly for precise object localization and reasoning tasks.

2. The MRD Framework: Architectural Components

MRD enhances RAP by jointly leveraging two strategies: Multi-resolution Semantic Fusion and Open-vocabulary Detection. The architectural pipeline consists of three core stages:

Multi-resolution Semantic Fusion: Conduct per-patch retrieval at multiple crop resolutions and aggregate their semantic cues.
Open-vocabulary Detection (OVD): Directly localize target objects globally using a sliding-window, open-vocabulary detector guided by context-extracted object classes.
Semantic–Detection Fusion: Linearly combine semantic and detector confidence maps to produce a unified attention map for final RE (Retrieval-Exploration) search.

3. Multi-resolution Semantic Fusion

Given an HR image $I$ , MRD selects multiple crop resolutions $R_i$ (e.g., $\ell$ and $\hat\ell = k\ell$ ), partitions $I$ at each $R_i$ into non-overlapping patches $P^{(R_i)}$ , and proceeds as follows:

Semantic Similarity Computation: For a text query $q$ , text and image patch embeddings are computed with encoders $f(\cdot)$ and $g(\cdot)$ . The cosine similarity score (normalized to $[0,1]$ ) for patch $p_j^{(R)}$ is:

$s^{(R)}_j = \frac12\left(1 + \frac{f(q)\cdot g(p_j^{(R)})}{\|f(q)\|\|g(p_j^{(R)})\|}\right)$

High-Resolution Map Projection: High-resolution similarity maps are mapped onto the finest grid, matching patches between resolutions using the mapping operator $H$ .
Consistency Fusion: At each spatial index $t$ , fused similarity is obtained via geometric mean:

$s^{\mathrm f}_t = \sqrt{\,s^{(\ell)}_t\cdot\tilde s_t\,}$

or, for $n$ resolutions,

$s^{\mathrm f}_t = \left(\prod_{i=1}^n s_t^{(R_i)}\right)^{1/n}$

2D Semantic Map Generation: The $n$ fused scores are reshaped into a $H \times W$ 2D map $S^{\mathrm f}$ .

The multi-resolution approach corrects low-resolution bias and preserves object integrity by amplifying high-resolution cues in fragmented or underscored regions.

4. Open-vocabulary Detection for Global Object Localization

The open-vocabulary detection (OVD) stage operates at the global level and employs sliding-window detection as follows:

Object Class Extraction: An LLM parses the query $Q$ to extract the set of target classes $O$ via in-context learning.
Sliding-window Detection: The image is split into a grid (e.g., at resolution $\ell$ ), and detection windows $W_t$ (size $h \times w$ , stride $s$ ) are defined. An open-vocabulary detector (LLMDet) predicts objects in each window, yielding sets of bounding boxes $\mathcal{B}_t$ with confidence scores $s_{tk}$ . Bounding boxes are filtered by confidence threshold $\tau$ .
Per-window Confidence Mapping: For each window $t$ , a local confidence map $c^w_t(p,q)$ propagates the maximum confidence among predicted boxes covering each patch.
Global Detection Map Construction: Each patch $(i,j)$ aggregates detection signals from all windows containing it:

$c^{\mathrm g}(i,j) = \frac{1}{|\mathcal T_{i,j}|} \sum_{t \in \mathcal T_{i,j}} c^w_t(t_i, t_j)$

The OVD module enables explicit, class-level detection and supports localization at global image scale, even for open-ended queries.

5. Semantic–Detection Fusion and Guidance Map

MRD linearly combines semantic and detection maps to yield a final guidance map:

$S^{\mathrm F}(i,j) = (1-w) S^{\mathrm f}(i,j) + w c^{\mathrm g}(i,j)$

The weighting parameter $w \in [0,1]$ determines the balance of semantic retrieval and detection cues. This fusion mechanism ensures both the intra-object detail derived from multi-resolution retrieval and the precise localization of open-vocabulary detection are exploited for final object and attribute grounding.

6. Experimental Protocol and Empirical Results

Datasets

Evaluation is conducted on V* Bench (2246×1582 images; attribute recognition and spatial reasoning), HRBench-4K, and HRBench-8K (Fine-grained Single-instance and Cross-instance Perception) (Yang et al., 2 Dec 2025).

Models and Baselines

Open-source MLLMs: LLaVA-ov-0.5B, LLaVA-v1.5-7B, LLaVA-v1.6, InternVL, Yi-VL.
Closed-source MLLMs: GPT-4o, Qwen-VL-Max.
Baselines: Zoom Eye, DC², RAP.

Metrics

Task accuracy (%) is reported for Attribute, Spatial, FSP, FCP, and Overall measures.

Quantitative Results

MRD demonstrates consistent improvements:

Method (V* Bench)	Overall (%)
RAP only	83.6
OVD only	84.9 (+1.3)
RAP + Multi-res	85.8 (+2.2)
RAP + OVD	86.2 (+2.6)
RAP + OVD + Multi-res (MRD)	89.3 (+5.7)

On LLaVA-v1.5-7B, MRD yields 95.6% overall on V* Bench (RAP: 91.1%), with improvements of +2.6% on HRBench-4K and +1.1% on HRBench-8K. Gains manifest across both fine-grained single- and multi-object tasks.

Qualitative Findings

Standard single-resolution semantic maps frequently under-cover large objects or flag irrelevant background regions.
Multi-resolution fusion reconstructs fragmented or missing object segments.
Combined detection maps suppress off-target areas and better maintain instance coverage in cross-instance settings.

7. Limitations, Prospective Enhancements, and Implications

Limitations

Sliding-window open-vocabulary detection is computationally intensive, especially for very high-resolution images (e.g., 8K), though window size and stride provide trade-offs.
The fusion weight $w$ is fixed across images; adaptive, context-sensitive weighting remains unexplored.
OVD may be less effective for heavily occluded or extremely small objects.

Potential Extensions

Learning dynamic fusion weights or gating mechanisms to combine maps per-query or per-image.
Adaptive selection of crop resolutions based on image content or query complexity.
Joint end-to-end fine-tuning of detection and retrieval encoders to minimize reliance on pretrained frozen modules.
Integration of detection-derived region proposals into the Retrieval-Exploration process for efficiency.

Implications

MRD illustrates that combining multi-scale semantic retrieval with open-vocabulary detection is an effective paradigm for high-resolution visual understanding in MLLMs. Incorporation of these components enables robust handling of gigapixel or panoramic content without the prohibitive cost of full-image encoding. The framework indicates the value of adaptive, multi-resolution pipelines for next-generation MLLMs tasked with complex, long-context image analysis (Yang et al., 2 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi-resolution Retrieval-Detection (MRD).