Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining (2107.14572v2)

Published 30 Jul 2021 in cs.CV

Abstract: Nowadays, customer's demands for E-commerce are more diversified, which introduces more complications to the product retrieval industry. Previous methods are either subject to single-modal input or perform supervised image-level product retrieval, thus fail to accommodate real-life scenarios where enormous weakly annotated multi-modal data are present. In this paper, we investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval among fine-grained product categories. To promote the study of this challenging task, we contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. Notably, Product1M contains over 1 million image-caption pairs and consists of two sample types, i.e., single-product and multi-product samples, which encompass a wide variety of cosmetics brands. In addition to the great diversity, Product1M enjoys several appealing characteristics including fine-grained categories, complex combinations, and fuzzy correspondence that well mimic the real-world scenes. Moreover, we propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE), that excels in capturing the potential synergy between multi-modal inputs via a hybrid-stream transformer in a self-supervised manner.CAPTURE generates discriminative instance features via masked multi-modal learning as well as cross-modal contrastive pretraining and it outperforms several SOTA cross-modal baselines. Extensive ablation studies well demonstrate the effectiveness and the generalization capacity of our model. Dataset and codes are available at https: //github.com/zhanxlin/Product1M.

PDF Abstract

Overview of Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining

The paper presents a novel research inquiry into the field of E-commerce by addressing the intricate problem of instance-level product retrieval in a multi-modal context, where existing methods have shown limitations. The primary contributions of this paper include the introduction of Product1M, a comprehensive dataset designed to simulate real-world product retrieval scenarios, and the development of CAPTURE, a sophisticated model for enhancing multi-modal retrieval effectiveness.

Key Contributions

1. Introduction of Product1M Dataset:

Product1M emerges as an expansive dataset tailored for instance-level retrieval, particularly in the cosmetic sector, encompassing over a million image-caption pairs with substantial diversity. This dataset features two types of samples—single-product and multi-product—both reflecting real-world complexities such as fine-grained categories, diverse product combinations, and fuzziness in image-text correspondences, which pose challenges for retrieval tasks.

2. Weakly Supervised Retrieval Setting:

The paper explores a pragmatic scenario where multi-modal instance-level product retrieval is performed with weak supervision. Unlike traditional image-level retrieval, this approach necessitates the extraction of fine-grained, instance-level features from vast quantities of weakly annotated data, capturing the intricate attribute and category distinctions between products.

3. Development of CAPTURE Model:

CAPTURE, a Cross-modal contrAstive Product Transformer, is proposed as a solution to facilitate instance-level product retrieval in multi-modal settings. This hybrid-stream transformer model integrates cross-modal feature learning in a self-supervised manner, employing masked multi-modal modeling tasks and cross-modal contrastive loss to align and harmonize image-text features effectively.

Experimental Insights

Through rigorous experimentation, the paper validates the superiority of CAPTURE over existing cross-modal self-supervised pretraining methods. CAPTURE demonstrates enhanced precision and recall metrics across various configurations, reinforcing its applicability and effectiveness in real-world multi-modal retrieval scenarios. Particularly, the model's performance in zero-shot retrieval scenarios underscores its potential for adapting to dynamic product categorization without explicit annotations.

Implications and Future Directions

The contributions of this paper have several significant implications:

Practical Application: The methodologies and insights gained from this research offer practical solutions for E-commerce platforms, enabling more accurate product retrieval systems that can efficiently handle multi-modal data and weak annotations.
Theoretical Advancements: This work extends the boundaries of cross-modal retrieval research by addressing instance-level challenges, highlighting the importance of multi-modal feature synchronization in achieving finer granularity in product retrieval.
Dataset Utility: Product1M serves as a valuable resource for continued exploration in both academic and commercial domains, providing a realistic benchmark for developing robust retrieval algorithms tailored to the intricacies of authentic E-commerce data.

Future Research Directions may focus on enhancing detection accuracy within CAPTURE, exploring novel augmentation techniques for improved multi-product detection, or examining transfer learning approaches to expand CAPTURE's applicability to broader domains beyond cosmetics.

In conclusion, this paper delivers a substantial advancement in understanding and executing instance-level retrieval in multi-modal, weakly supervised contexts, promising to catalyze further innovation within E-commerce product retrieval systems and beyond.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xunlin Zhan (5 papers)
Yangxin Wu (5 papers)
Xiao Dong (62 papers)
Yunchao Wei (151 papers)
Minlong Lu (5 papers)
Yichi Zhang (184 papers)
Hang Xu (204 papers)
Xiaodan Liang (318 papers)

Citations (58)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - zhanxlin/Product1M: Product1M (84 stars)