LRVS-Fashion: Extending Visual Search with Referring Instructions (2306.02928v3)

Published 5 Jun 2023 in cs.CV

Abstract: This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors. The dataset is available at https://huggingface.co/datasets/Slep/LAION-RVS-Fashion .

Summary

The paper introduces a weakly-supervised conditional embedding method that boosts Recall@1 by 6% over traditional models.
It presents the LAION-RVS-Fashion dataset with 272K fashion products and 842K images, supporting evaluations with over 2M distractors.
The method uses a Vision Transformer conditioned by user input tokens to achieve precise, user-specific retrieval in complex image scenarios.

Weakly-Supervised Conditional Embedding for Referred Visual Search

The paper introduces a novel approach to the problem of image similarity search within the fashion domain, characterized by the inherent complexity of defining similarity. The proposed method, termed Referred Visual Search (RVS), allows users to specify the kind of similarity desired, addressing the ambiguity stemming from multiple similarity metrics (e.g., color, style, pattern) that often plague image retrieval tasks in fashion.

Contributions

Two principal contributions are offered: the introduction of a new dataset, LAION-RVS-Fashion (LRVS-F), and a method for learning conditional embeddings through weakly-supervised training.

LAION-RVS-Fashion Dataset: Extracted from LAION 5B, this dataset contains 272K fashion products and 842K images, accommodating a large test set with over 2M distractors. The design encourages the extraction of specific object features and enables robust evaluations against extensive galleries.
Weakly-Supervised Conditional Embedding: The proposed method employs a Vision Transformer model conditioned by additional input tokens representing user specifications. This approach surpasses traditional explicit detection and segmentation methods, demonstrating an increase of 6% in Recall at One (R@1) over models employing classical attention and filtering techniques.

Experimental Evaluation

The experimental results confirm the method's efficacy, particularly for complex image scenarios involving multiple distractors. The trained models exhibit substantial improvements in conditional retrieval tasks, maintaining superior recall rates with 2.5 times the distractors as the baseline methods. The capacity to dynamically focus on user-defined conditions is evidenced by qualitative retrieval samples, which highlight the model’s effectiveness in adapting its embeddings to user queries.

When compared to existing approaches such as ASEN and baseline Vision Transformer models, the proposed CondViT shows significantly higher performance. Notably, the model remains robust even as the gallery size and number of distractors increase, suggesting its potential applicability in real-world large-scale retrieval systems.

Implications and Future Directions

The implications of this research extend both practically and theoretically within the domain of fashion image retrieval and beyond. Practically, the ability to incorporate detailed user inputs directly into the retrieval model without pre-defined categories or reliance on segmentations streamlines the user experience and improves accuracy in product searches. Theoretically, the introduction of conditional embeddings enriches the landscape of large-scale retrieval models by demonstrating that user context and conditions can be integrated seamlessly and effectively.

Future directions for this line of research could involve expanding the scope of conditional embeddings to other modalities or integrating more sophisticated user queries. Additionally, exploring more diverse datasets beyond fashion could enhance the generalizability of the method.

In conclusion, this research offers a compelling advancement in the field of visual search, contributing both a valuable dataset and an innovative method for achieving highly nuanced, condition-specific retrievals. The potential applications of these findings and methodologies may impact various sectors requiring precise image retrieval functionalities.

PDF Markdown

Related Papers

GitHub

GitHub - Simon-Lepage/CondViT-LRVSF: Implementation of Conditional ViT on LAION — Referred Visual Search — Fashion (42 stars)

Tweets

https://twitter.com/david_picard/status/1772577861448184207

https://twitter.com/david_picard/status/1773233312988868906

https://twitter.com/CSVisionPapers/status/1791322221484311029