DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Published 11 Feb 2026 in cs.CV and cs.IR | (2602.10809v1)

Abstract: Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-LLMs to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DeepImageSearch, a context-aware retrieval paradigm that leverages multimodal reasoning and agentic exploration of visual histories.
It details a novel benchmark, DISBench, which uses a hybrid human-model pipeline and memory graph formation to simulate complex, real-world retrieval scenarios.
The agentic ImageSeeker framework demonstrates significant reasoning challenges in inter-event queries, emphasizing the need for improved memory architectures and spatiotemporal reasoning.

DeepImageSearch: Context-Aware Image Retrieval via Agentic Multimodal Reasoning

Motivation and Paradigm Shift

DeepImageSearch proposes a novel paradigm for image retrieval: context-dependent, multi-step exploration of visual histories, contrasting with standard instance-wise retrieval, which scores images independently against queries. The motivation follows from the limitations of current retrieval benchmarks and models—while recent multimodal models exhibit robust semantic alignment, they lack the capacity to resolve queries requiring reasoning about distributed evidence and latent associations within complex visual streams. This deficiency is critical for real-world applications, where information richness and ambiguity often exceed the representational scope of a single image or textual embedding.

Benchmark Construction: DISBench

DISBench is introduced as a large-scale benchmark specifically tailored to the context-aware retrieval task. Construction involves a hybrid human-model pipeline:

Automated Association Mining: Vision-LLMs (VLMs) parse user photo histories extracted from the YFCC100M dataset, identifying salient visual entities, recurring persons, scene cues, and structural metadata.
Memory Graph Formation: Extracted elements are organized into a heterogeneous graph capturing photos, events, objects, and verified cross-image associations.
Query Synthesis: Subgraphs are sampled to induce context-rich queries, with multi-step reasoning required to bridge visual ambiguity among targets. Queries are then paraphrased and rigorously validated by expert annotators to enforce ambiguity, association-based identifiability, and non-trivial reasoning paths.

DISBench comprises 122 queries spanning 109,467 images from 57 users, capturing both event-local and cross-event ("intra-event" and "inter-event") relationships, challenging models to act on distributed, temporally-ordered, and user-centric data.

Agentic Framework: ImageSeeker

To target this task, the authors present the ImageSeeker agentic framework:

Tool-Based Exploration: The agent leverages image retrieval, metadata filtering, and visual verification tools. These tools are composable and stateful, enabling persistent narrowing of candidate sets.
Dual-Memory System: State memory explicitly stores intermediate photo subsets, enabling chaining of reasoning steps, while compressed context memory preserves plan and history within context-length limits.
Structured Prompting and Planning: Agents are prompted to decompose queries into events, logical steps, and targets, preventing conflation of anchor evidence and intended retrieval sets.

This setup enables robust testing of automated reasoning over long corpora, substantially exceeding the challenge of individual-ranking paradigms.

Experimental Findings and Numerical Results

DISBench reveals a substantial gap between existing model capabilities and the requirements for corpus-level contextual reasoning:

Baseline Performance: The best-performing agent (Claude-Opus-4.5-20251101) reaches an Exact Match score of 28.7 and F1 of 55.0. For direct embedding-based retrieval systems, Recall@3 is below 14%, confirming the inadequacy of traditional one-shot retrieval mechanisms.
Query Type Challenge: Inter-event queries, requiring the chaining of evidence across events, consistently present higher difficulty than intra-event queries, with strong models exhibiting clear performance drops on inter-event tasks.
Ablation Studies: Removal of metadata tools and explicit memory mechanisms leads to pronounced performance deterioration (e.g., F1 dropping by over 5 points without GetMetadata), particularly for inter-event reasoning.
Test-Time Scaling: Running agents in parallel and selecting the best path unlocks latent performance (Best@k F1 up to 60.8), though typical prediction aggregation (majority voting) does not close the gap, highlighting open challenges in robust reasoning path selection.

The error analysis delineates that reasoning breakdown and state management failures dwarf raw visual discrimination errors, emphasizing that planning, context tracking, and association mining are core bottlenecks.

Implications and Future Directions

DeepImageSearch and DISBench concretize the corpus-contextual reasoning deficit in multimodal models, establishing new evaluation and development frontiers. Empirical findings argue that scaling vision-language backbones or embeddings alone cannot solve this deficit—progress mandates advances in memory architectures, agentic planning modules, and spatiotemporal reasoning.

Practical implications include the development of intelligent personal assistants capable of navigating visual memory over years rather than retrieving isolated instances. This moves the field toward models that not only match on content but can reconstruct high-level narrative structure and reason over distributed, temporally-evolving evidence.

Methodologically, the semi-automated pipeline for mining and vetting association-centric benchmarks can generalize to new domains requiring scalable, high-quality, multi-modal annotation, including other types of personal data or dynamic corpora.

Ethically, the work foregrounds privacy concerns inherent in reasoning over user-centric visual histories, stressing the need for deployment safeguards, conservative data practice, and transparency.

Future research is likely to focus on learned planning, dynamic memory compression, robust error recovery in agentic search, and automated selection of reasoning strategies. Advanced forms of self-reflection, tool-based backtracking, and continual learning mechanisms will be essential to push agentic image retrieval closer to practical, user-facing deployment.

Conclusion

DeepImageSearch reframes visual retrieval as an agentic reasoning task demanding the synthesis of distributed contextual evidence from visual histories. DISBench and the proposed ImageSeeker agent reveal significant unsolved challenges for state-of-the-art models, substantiating the necessity for new algorithms and architectures focused on corpus-level, context-aware inference. The work establishes rigorous standards for benchmarking and points toward the development of autonomous systems capable of reasoning over complex, temporally-structured user data (2602.10809).

Markdown Report Issue