Compositional Learning of Image-Text Query for Image Retrieval (2006.11149v3)

Published 19 Jun 2020 in cs.CV and cs.IR

Abstract: In this paper, we investigate the problem of retrieving images from a database based on a multi-modal (image-text) query. Specifically, the query text prompts some modification in the query image and the task is to retrieve images with the desired modifications. For instance, a user of an E-Commerce platform is interested in buying a dress, which should look similar to her friend's dress, but the dress should be of white color with a ribbon sash. In this case, we would like the algorithm to retrieve some dresses with desired modifications in the query dress. We propose an autoencoder based model, ComposeAE, to learn the composition of image and text query for retrieving images. We adopt a deep metric learning approach and learn a metric that pushes composition of source image and text query closer to the target images. We also propose a rotational symmetry constraint on the optimization problem. Our approach is able to outperform the state-of-the-art method TIRG \cite{TIRG} on three benchmark datasets, namely: MIT-States, Fashion200k and Fashion IQ. In order to ensure fair comparison, we introduce strong baselines by enhancing TIRG method. To ensure reproducibility of the results, we publish our code here: \url{https://github.com/ecom-research/ComposeAE}.

Citations (79)

View on Semantic Scholar

Summary

The paper presents ComposeAE, which uses a complex space representation and novel rotational symmetry constraints to combine image and text features.
It leverages BERT-based text embeddings and convolutional mapping to capture both global context and local image details in query modifications.
Empirical results on benchmarks like Fashion200k and MIT-States demonstrate up to a 30.12% improvement in Recall@10 over previous methods.

An Evaluation of Compositional Learning of Image-Text Query for Image Retrieval

The paper presents a novel model, ComposeAE, designed for the task of image retrieval through compositional learning of multi-modal queries composed of images and text. This task mandates an image retrieval system not only to consider the visual information from a query image but also to enact specified modifications articulated in a textual query. This concept bears relevance in applications such as E-Commerce, where a customer might wish to find variations of a product based on a given image and additional specifications.

ComposeAE leverages an autoencoder architecture underpinned by deep metric learning principles. By constructing a common complex space for both image and text features, ComposeAE introduces a method to capture the nuanced relationship encoded within a multi-modal query. The paper introduces several key technical innovations that distinguish ComposeAE from previous methods such as TIRG.

Technical Innovations and Methodology

Complex Space Representation: The paper posits that the target image features can be conceptualized as a rotation of the query image features within a complex space. This formulation enables the representation of transformation information conveyed by the textual query in terms of angular displacements, with the rotation being specified by the text embeddings.
Rotational Symmetry Constraint: The authors propose a novel symmetry constraint during optimization, where the rotation required by the text features should theoretically exhibit a reverse transition by its complex conjugate. This mirrors the compositional symmetry often found in linguistic transformations and acts as an implicit regularizer promoting robust feature composition.
Utilization of BERT for Text Embeddings: Departing from traditional LSTM-based text embedding approaches, ComposeAE employs BERT embeddings, hypothesizing that BERT’s contextual LLM offers superior feature representations for complex and nuanced natural language queries beyond the capability of LSTMs.
Convolutional Mapping: The model includes a unique convolutional mapping stage, integrating both local and global feature interactions, particularly beneficial for capturing spatial dependencies when modifications involve localized image details.

Results and Baseline Comparisons

The empirical evaluation of ComposeAE is performed on three benchmark datasets: MIT-States, Fashion200k, and Fashion IQ. ComposeAE demonstrates superior performance over existing methods, including the then SOTA TIRG. ComposeAE achieves a significant performance margin of 30.12% over TIRG on Fashion200k and 11.13% on MIT-States in terms of Recall@10, highlighting its efficacy in compositional query learning. The composed architecture and rotational symmetry constraints appear to bolster the retrieval capabilities particularly in datasets characterized by complex and verbose query modifications, such as Fashion IQ.

Practical and Theoretical Implications

The proposed approach enhances the understanding of how complex image-text relationships can be modeled through joint representations. From a practical standpoint, ComposeAE provides significant improvements for intelligent systems tasked with understanding and responding to complex multi-modal queries—improving the utility in consumer-centric applications like e-commerce where customers’ descriptions often accompany visual input.

On a theoretical level, this work furthers discourse on embedding techniques that straddle the intersection of visual perception and linguistic processing. Future developments may further explore the complex domain embedding techniques and their applications across different modalities and tasks.

Conclusion

In summary, this paper presents a compelling advancement in the domain of image retrieval by addressing the compositional learning of image-text queries in an innovative manner through its autoencoder-based architecture and complex space search. The introduction of the rotational symmetry constraint as a novel regularization mechanism alongside leveraging cutting-edge NLP embeddings from BERT offers a significant leap from previous methodologies. The findings suggest potential avenues for further research into the alignment of modalities in joint embedding spaces, positing implications for numerous AI applications in retrieval systems and beyond.

PDF Markdown

Related Papers

GitHub

GitHub - ecom-research/ComposeAE: Official code for WACV 2021 paper - Compositional Learning of Image-Text Query for Image Retrieval (57 stars)