MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce (2408.14968v1)

Published 27 Aug 2024 in cs.IR and cs.CL

Abstract: Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.

Summary

The paper introduces MRSE, a multi-modality retrieval system that integrates text, images, and user preferences using lightweight mixture-of-experts modules.
The paper employs a two-stage architecture for feature extraction and fusion, achieving over 18.9% improvement in offline relevance and a 3.7% gain in online metrics.
The paper demonstrates that combining softmax cross-entropy with hard-negative triplet loss enhances model robustness and retrieval performance for large-scale e-commerce.

An Efficient Multi-Modality Retrieval System (MRSE) for Large Scale E-commerce

Introduction

The paper under discussion proposes MRSE, an advanced Multi-modality Retrieval System for E-commerce, designed to address the limitations of traditional Embedding-based Retrieval Systems (ERS) which primarily rely on textual features. The system aims to integrate various modalities—text, images, and user preferences—through lightweight mixture-of-experts (LMoE) modules, thereby providing a robust solution for high-quality item recall in large-scale e-commerce platforms.

Key Components and Methodology

The MRSE architecture consists of two main stages—multi-modality extraction and fusion:

Multi-modality Extraction (Stage 1):
- VBert: Integrates visual and textual features using a combination of a pre-trained VIT-L model and lightweight BERT.
- FtAtt: Employs FastText for text feature extraction and an attention mechanism to capture token-level significance.
- Light-BERT: Utilizes a two-layer BERT model for extracting deeper semantic features from text.
Multi-modality Fusion (Stage 2):
- Implements a Deep Structured Semantic Model (DSSM) to integrate the multi-modal representations derived from LMoE modules into a unified embedding space.

Hybrid Loss Function

A novel hybrid loss function is introduced to enhance the model's robustness:

Softmax Cross-Entropy Loss: Uses mixed negative sampling for accelerating model convergence.
Hard-Negative Triplet Loss: Focuses on distinguishing between relevant and hard-negative samples, ensuring better alignment of multi-modal features.
The combination of these loss functions is shown to improve overall model performance significantly.

Experimental Results

Extensive experiments on a large-scale industrial dataset from Shopee demonstrate substantial performance gains:

MRSE achieves over 18.9% improvement in offline relevance and a 3.7% gain in online core metrics (including IMP, CTR, and GMV) compared to Shopee’s state-of-the-art uni-modality product understanding system.
The system is validated through offline metrics such as Recall@ $K_w^r$ , Recall@ $K_u^r$ , and Rele@ $K_w^p$ , and further confirmed via online A/B testing.

Implications and Future Directions

The findings indicate that MRSE effectively integrates various modalities to capture complex user intentions, thus enhancing the relevance and diversity of retrieved items. The "Divide and Conquer" strategy ensures that the system adapts to individual user preferences, contributing to a better user experience and increased business efficiency.

In terms of theoretical implications, this work highlights the importance of leveraging multi-modal data and hybrid training objectives to improve retrieval systems. Practically, the deployment of MRSE as a base model at Shopee sets a precedent for other e-commerce platforms aiming to enhance their search systems.

Future research directions include incorporating long-term user behavior and temporal data to further refine retrieval capabilities, thereby ensuring sustained improvement of e-commerce search systems.

Conclusion

MRSE represents a significant advancement in the field of multi-modality retrieval systems for e-commerce, addressing key limitations of existing ERS models. Its innovative architecture and hybrid loss function demonstrate substantial improvements in item recall and search performance, highlighting the importance of integrating diverse modalities and user behaviors in modern e-commerce platforms.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1828626311088386256

https://twitter.com/gm8xx8/status/1828627822081916932