Papers
Topics
Authors
Recent
2000 character limit reached

Multi-resolution Retrieval-Detection (MRD)

Updated 7 December 2025
  • Multi-resolution Retrieval-Detection (MRD) is a framework that fuses multi-scale semantic retrieval with open-vocabulary detection to overcome fixed low-resolution input limitations in MLLMs.
  • It employs multi-resolution semantic fusion to combine cues from various crop sizes, preserving fine details and spatial coherence in high-resolution images.
  • The integration of sliding-window object detection improves global object localization, boosting tasks like attribute recognition, spatial reasoning, and fine-grained analysis.

Multi-resolution Retrieval-Detection (MRD) is a training-free framework designed to enhance high-resolution (HR) image understanding in Multimodal LLMs (MLLMs). MRD addresses the difficulty imposed by fixed, low input resolutions common to MLLMs by integrating multi-resolution semantic retrieval and open-vocabulary object detection. These complementary components enable robust performance on tasks such as attribute recognition, spatial reasoning, and fine-grained object localization in high-resolution images (Yang et al., 2 Dec 2025).

1. Limitations of Prior High-Resolution Image Understanding Methods

High-resolution images typically cannot be fed directly into standard MLLMs due to their limited input size (e.g., 224×224 pixels). Direct downsampling leads to information loss, blurring, and reduction in the fidelity of small or spatially distributed objects. Early approaches such as "Zoom Eye" and "DC²" apply hierarchical or tree-based crop selection strategies but risk fragmenting objects or missing small details. Retrieval-Augmented Perception (RAP) extends the Retrieval-Augmented Generation (RAG) approach from language to vision by extracting uniformly sized crops and leveraging a pretrained vision RAG (VisRAG) model to compute per-crop semantic similarities against a query, subsequently selecting a set of crops for input into the MLLM.

However, RAP and similar single-resolution methods suffer several shortcomings:

  • Large objects may be split across multiple patches, diluting their semantic similarity and retrieval consistency.
  • Small objects are prone to disappearance at low resolutions.
  • Crop resolution selection becomes a critical and sensitive hyperparameter.

These issues result in performance degradation on HR image understanding, particularly for precise object localization and reasoning tasks.

2. The MRD Framework: Architectural Components

MRD enhances RAP by jointly leveraging two strategies: Multi-resolution Semantic Fusion and Open-vocabulary Detection. The architectural pipeline consists of three core stages:

  1. Multi-resolution Semantic Fusion: Conduct per-patch retrieval at multiple crop resolutions and aggregate their semantic cues.
  2. Open-vocabulary Detection (OVD): Directly localize target objects globally using a sliding-window, open-vocabulary detector guided by context-extracted object classes.
  3. Semantic–Detection Fusion: Linearly combine semantic and detector confidence maps to produce a unified attention map for final RE (Retrieval-Exploration) search.

3. Multi-resolution Semantic Fusion

Given an HR image II, MRD selects multiple crop resolutions RiR_i (e.g., \ell and ^=k\hat\ell = k\ell), partitions II at each RiR_i into non-overlapping patches P(Ri)P^{(R_i)}, and proceeds as follows:

  1. Semantic Similarity Computation: For a text query qq, text and image patch embeddings are computed with encoders f()f(\cdot) and g()g(\cdot). The cosine similarity score (normalized to [0,1][0,1]) for patch pj(R)p_j^{(R)} is:

sj(R)=12(1+f(q)g(pj(R))f(q)g(pj(R)))s^{(R)}_j = \frac12\left(1 + \frac{f(q)\cdot g(p_j^{(R)})}{\|f(q)\|\|g(p_j^{(R)})\|}\right)

  1. High-Resolution Map Projection: High-resolution similarity maps are mapped onto the finest grid, matching patches between resolutions using the mapping operator HH.
  2. Consistency Fusion: At each spatial index tt, fused similarity is obtained via geometric mean:

stf=st()s~ts^{\mathrm f}_t = \sqrt{\,s^{(\ell)}_t\cdot\tilde s_t\,}

or, for nn resolutions,

stf=(i=1nst(Ri))1/ns^{\mathrm f}_t = \left(\prod_{i=1}^n s_t^{(R_i)}\right)^{1/n}

  1. 2D Semantic Map Generation: The nn fused scores are reshaped into a H×WH \times W 2D map SfS^{\mathrm f}.

The multi-resolution approach corrects low-resolution bias and preserves object integrity by amplifying high-resolution cues in fragmented or underscored regions.

4. Open-vocabulary Detection for Global Object Localization

The open-vocabulary detection (OVD) stage operates at the global level and employs sliding-window detection as follows:

  1. Object Class Extraction: An LLM parses the query QQ to extract the set of target classes OO via in-context learning.
  2. Sliding-window Detection: The image is split into a grid (e.g., at resolution \ell), and detection windows WtW_t (size h×wh \times w, stride ss) are defined. An open-vocabulary detector (LLMDet) predicts objects in each window, yielding sets of bounding boxes Bt\mathcal{B}_t with confidence scores stks_{tk}. Bounding boxes are filtered by confidence threshold τ\tau.
  3. Per-window Confidence Mapping: For each window tt, a local confidence map ctw(p,q)c^w_t(p,q) propagates the maximum confidence among predicted boxes covering each patch.
  4. Global Detection Map Construction: Each patch (i,j)(i,j) aggregates detection signals from all windows containing it:

cg(i,j)=1Ti,jtTi,jctw(ti,tj)c^{\mathrm g}(i,j) = \frac{1}{|\mathcal T_{i,j}|} \sum_{t \in \mathcal T_{i,j}} c^w_t(t_i, t_j)

The OVD module enables explicit, class-level detection and supports localization at global image scale, even for open-ended queries.

5. Semantic–Detection Fusion and Guidance Map

MRD linearly combines semantic and detection maps to yield a final guidance map:

SF(i,j)=(1w)Sf(i,j)+wcg(i,j)S^{\mathrm F}(i,j) = (1-w) S^{\mathrm f}(i,j) + w c^{\mathrm g}(i,j)

The weighting parameter w[0,1]w \in [0,1] determines the balance of semantic retrieval and detection cues. This fusion mechanism ensures both the intra-object detail derived from multi-resolution retrieval and the precise localization of open-vocabulary detection are exploited for final object and attribute grounding.

6. Experimental Protocol and Empirical Results

Datasets

Evaluation is conducted on V* Bench (2246×1582 images; attribute recognition and spatial reasoning), HRBench-4K, and HRBench-8K (Fine-grained Single-instance and Cross-instance Perception) (Yang et al., 2 Dec 2025).

Models and Baselines

  • Open-source MLLMs: LLaVA-ov-0.5B, LLaVA-v1.5-7B, LLaVA-v1.6, InternVL, Yi-VL.
  • Closed-source MLLMs: GPT-4o, Qwen-VL-Max.
  • Baselines: Zoom Eye, DC², RAP.

Metrics

Task accuracy (%) is reported for Attribute, Spatial, FSP, FCP, and Overall measures.

Quantitative Results

MRD demonstrates consistent improvements:

Method (V* Bench) Overall (%)
RAP only 83.6
OVD only 84.9 (+1.3)
RAP + Multi-res 85.8 (+2.2)
RAP + OVD 86.2 (+2.6)
RAP + OVD + Multi-res (MRD) 89.3 (+5.7)

On LLaVA-v1.5-7B, MRD yields 95.6% overall on V* Bench (RAP: 91.1%), with improvements of +2.6% on HRBench-4K and +1.1% on HRBench-8K. Gains manifest across both fine-grained single- and multi-object tasks.

Qualitative Findings

  • Standard single-resolution semantic maps frequently under-cover large objects or flag irrelevant background regions.
  • Multi-resolution fusion reconstructs fragmented or missing object segments.
  • Combined detection maps suppress off-target areas and better maintain instance coverage in cross-instance settings.

7. Limitations, Prospective Enhancements, and Implications

Limitations

  • Sliding-window open-vocabulary detection is computationally intensive, especially for very high-resolution images (e.g., 8K), though window size and stride provide trade-offs.
  • The fusion weight ww is fixed across images; adaptive, context-sensitive weighting remains unexplored.
  • OVD may be less effective for heavily occluded or extremely small objects.

Potential Extensions

  • Learning dynamic fusion weights or gating mechanisms to combine maps per-query or per-image.
  • Adaptive selection of crop resolutions based on image content or query complexity.
  • Joint end-to-end fine-tuning of detection and retrieval encoders to minimize reliance on pretrained frozen modules.
  • Integration of detection-derived region proposals into the Retrieval-Exploration process for efficiency.

Implications

MRD illustrates that combining multi-scale semantic retrieval with open-vocabulary detection is an effective paradigm for high-resolution visual understanding in MLLMs. Incorporation of these components enables robust handling of gigapixel or panoramic content without the prohibitive cost of full-image encoding. The framework indicates the value of adaptive, multi-resolution pipelines for next-generation MLLMs tasked with complex, long-context image analysis (Yang et al., 2 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Multi-resolution Retrieval-Detection (MRD).