Contextual reasoning over visual histories for image retrieval

Develop robust algorithms for corpus-level contextual reasoning over user visual histories that can accurately retrieve the target image sets specified by natural-language queries in the DeepImageSearch task and its DISBench benchmark, requiring multi-step exploration and cross-event association discovery.

Background

DeepImageSearch reframes image retrieval from independent semantic matching to agentic, multi-step exploration over a user’s chronologically ordered photo corpus, where evidence and targets may appear in different images and events. DISBench operationalizes this setting with queries that require either intra-event localization and filtering or inter-event comparison under temporal/spatial constraints.

Despite a specialized agent framework and benchmarking of strong multimodal models, results show low Exact Match scores (best 28.7), with frequent failures in long-horizon planning and cross-event association discovery. Based on these observations, the authors explicitly state that contextual reasoning over visual histories remains an open problem.

References

Extensive experiments demonstrate that this task poses significant challenges for state-of-the-art models, confirming that contextual reasoning over visual histories remains an open problem.

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories  (2602.10809 - Deng et al., 11 Feb 2026) in Section 6. Conclusion