- The paper introduces MRSE, a multi-modality retrieval system that integrates text, images, and user preferences using lightweight mixture-of-experts modules.
- The paper employs a two-stage architecture for feature extraction and fusion, achieving over 18.9% improvement in offline relevance and a 3.7% gain in online metrics.
- The paper demonstrates that combining softmax cross-entropy with hard-negative triplet loss enhances model robustness and retrieval performance for large-scale e-commerce.
An Efficient Multi-Modality Retrieval System (MRSE) for Large Scale E-commerce
Introduction
The paper under discussion proposes MRSE, an advanced Multi-modality Retrieval System for E-commerce, designed to address the limitations of traditional Embedding-based Retrieval Systems (ERS) which primarily rely on textual features. The system aims to integrate various modalities—text, images, and user preferences—through lightweight mixture-of-experts (LMoE) modules, thereby providing a robust solution for high-quality item recall in large-scale e-commerce platforms.
Key Components and Methodology
The MRSE architecture consists of two main stages—multi-modality extraction and fusion:
- Multi-modality Extraction (Stage 1):
- VBert: Integrates visual and textual features using a combination of a pre-trained VIT-L model and lightweight BERT.
- FtAtt: Employs FastText for text feature extraction and an attention mechanism to capture token-level significance.
- Light-BERT: Utilizes a two-layer BERT model for extracting deeper semantic features from text.
- Multi-modality Fusion (Stage 2):
- Implements a Deep Structured Semantic Model (DSSM) to integrate the multi-modal representations derived from LMoE modules into a unified embedding space.
Hybrid Loss Function
A novel hybrid loss function is introduced to enhance the model's robustness:
- Softmax Cross-Entropy Loss: Uses mixed negative sampling for accelerating model convergence.
- Hard-Negative Triplet Loss: Focuses on distinguishing between relevant and hard-negative samples, ensuring better alignment of multi-modal features.
- The combination of these loss functions is shown to improve overall model performance significantly.
Experimental Results
Extensive experiments on a large-scale industrial dataset from Shopee demonstrate substantial performance gains:
- MRSE achieves over 18.9% improvement in offline relevance and a 3.7% gain in online core metrics (including IMP, CTR, and GMV) compared to Shopee’s state-of-the-art uni-modality product understanding system.
- The system is validated through offline metrics such as Recall@Kwr, Recall@Kur, and Rele@Kwp, and further confirmed via online A/B testing.
Implications and Future Directions
The findings indicate that MRSE effectively integrates various modalities to capture complex user intentions, thus enhancing the relevance and diversity of retrieved items. The "Divide and Conquer" strategy ensures that the system adapts to individual user preferences, contributing to a better user experience and increased business efficiency.
In terms of theoretical implications, this work highlights the importance of leveraging multi-modal data and hybrid training objectives to improve retrieval systems. Practically, the deployment of MRSE as a base model at Shopee sets a precedent for other e-commerce platforms aiming to enhance their search systems.
Future research directions include incorporating long-term user behavior and temporal data to further refine retrieval capabilities, thereby ensuring sustained improvement of e-commerce search systems.
Conclusion
MRSE represents a significant advancement in the field of multi-modality retrieval systems for e-commerce, addressing key limitations of existing ERS models. Its innovative architecture and hybrid loss function demonstrate substantial improvements in item recall and search performance, highlighting the importance of integrating diverse modalities and user behaviors in modern e-commerce platforms.