Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dialog-based Interactive Image Retrieval (1805.00145v3)

Published 1 May 2018 in cs.CV and cs.AI

Abstract: Existing methods for interactive image retrieval have demonstrated the merit of integrating user feedback, improving retrieval results. However, most current systems rely on restricted forms of user feedback, such as binary relevance responses, or feedback based on a fixed set of relative attributes, which limits their impact. In this paper, we introduce a new approach to interactive image search that enables users to provide feedback via natural language, allowing for more natural and effective interaction. We formulate the task of dialog-based interactive image retrieval as a reinforcement learning problem, and reward the dialog system for improving the rank of the target image during each dialog turn. To mitigate the cumbersome and costly process of collecting human-machine conversations as the dialog system learns, we train our system with a user simulator, which is itself trained to describe the differences between target and candidate images. The efficacy of our approach is demonstrated in a footwear retrieval application. Experiments on both simulated and real-world data show that 1) our proposed learning framework achieves better accuracy than other supervised and reinforcement learning baselines and 2) user feedback based on natural language rather than pre-specified attributes leads to more effective retrieval results, and a more natural and expressive communication interface.

Dialog-based Interactive Image Retrieval

This paper presents a novel approach to interactive image retrieval that leverages dialog-based user feedback in natural language, moving beyond conventional methods that rely on binary relevance feedback or predetermined sets of attributes. The proposed system introduces a multi-modal dialog protocol where users can iteratively refine image search results by providing free-form natural language feedback. This approach addresses the inherent limitations of pre-defined feedback mechanisms, enabling more expressive and effective user interactions.

Methodology

The core of this system involves framing the interactive image retrieval task as a reinforcement learning (RL) problem. The dialog agent is rewarded for improving the ranking position of the target image with each dialog turn. This RL formulation allows the system to directly optimize the retrieval performance, focusing on maximizing the rank of the desired image in the retrieval list rather than relying on simpler, attribute-defined metrics.

A significant challenge addressed in this research is the expense involved in collecting data from human-machine interactions for training purposes. To overcome this, the authors utilize a user simulator trained to generate natural language feedback that captures visual differences between images. This simulator substitutes for human users during the initial training phases, allowing the dialog agent to learn efficiently without extensive human dialogue data.

The paper details the architecture of the proposed dialog manager, consisting of three main components: a response encoder, a state tracker, and a candidate generator. The response encoder creates joint visual-semantic representations from user feedback and candidate images. The state tracker aggregates dialog history, while the candidate generator selects the next image to present to the user to maximize retrieval performance.

Results and Implications

Empirical evaluations demonstrate the system’s efficacy in an interactive footwear retrieval task, showing that it consistently outperforms traditional attribute-based methods across various metrics. Notably, the framework achieves superior accuracy compared to supervised learning baselines and iterative attribute feedback approaches. The results highlight the advantage of incorporating natural language in retrieval dialogs, leading to a communication interface that is not only more flexible but also significantly more effective.

One of the key implications of this research is its potential applicability across different types of visual media beyond image retrieval, such as video or graphical content. The introduction of natural language feedback could transform how retrieval systems interact with vast datasets, aligning them more closely with human communication patterns and potentially improving the retrieval experience.

Future Directions

Future work, as suggested by the authors, could involve extending this dialog-based retrieval framework to include other modalities and incorporating external semantic information, such as textual metadata associated with images. This could further enhance the natural language interface and broaden the system's applicability. Moreover, refining the user simulator to incorporate dialog history more effectively could lead to even more responsive and contextually aware dialog agents.

Overall, this paper makes a substantial contribution to interactive image retrieval by pioneering the use of fully expressive natural language feedback within a reinforcement learning framework. This approach represents a significant step toward more nuanced, human-like interactions in AI-driven retrieval systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaoxiao Guo (38 papers)
  2. Hui Wu (54 papers)
  3. Yu Cheng (354 papers)
  4. Steven Rennie (6 papers)
  5. Gerald Tesauro (29 papers)
  6. Rogerio Schmidt Feris (2 papers)
Citations (193)