Embedding-based Product Retrieval in Taobao Search

Published 17 Jun 2021 in cs.IR | (2106.09297v1)

Abstract: Nowadays, the product search service of e-commerce platforms has become a vital shopping channel in people's life. The retrieval phase of products determines the search system's quality and gradually attracts researchers' attention. Retrieving the most relevant products from a large-scale corpus while preserving personalized user characteristics remains an open question. Recent approaches in this domain have mainly focused on embedding-based retrieval (EBR) systems. However, after a long period of practice on Taobao, we find that the performance of the EBR system is dramatically degraded due to its: (1) low relevance with a given query and (2) discrepancy between the training and inference phases. Therefore, we propose a novel and practical embedding-based product retrieval model, named Multi-Grained Deep Semantic Product Retrieval (MGDSPR). Specifically, we first identify the inconsistency between the training and inference stages, and then use the softmax cross-entropy loss as the training objective, which achieves better performance and faster convergence. Two efficient methods are further proposed to improve retrieval relevance, including smoothing noisy training data and generating relevance-improving hard negative samples without requiring extra knowledge and training procedures. We evaluate MGDSPR on Taobao Product Search with significant metrics gains observed in offline experiments and online A/B tests. MGDSPR has been successfully deployed to the existing multi-channel retrieval system in Taobao Search. We also introduce the online deployment scheme and share practical lessons of our retrieval system to contribute to the community.

Abstract PDF Upgrade to Chat

Citations (77)

View on Semantic Scholar

Summary

The paper introduces MGDSPR, a dual-tower model that leverages softmax cross-entropy loss to align training with inference for enhanced product search.
It employs multi-grained semantic pooling in the user tower and a simplified item representation to improve relevance and mitigate noisy training data.
Online A/B tests demonstrate a 2.5% improvement in Recall@1000 and a 13.3% increase in the good rate, significantly boosting GMV in Taobao Search.

Embedding-based Product Retrieval in Taobao Search: An Academic Summary

In the paper titled "Embedding-based Product Retrieval in Taobao Search," researchers from Alibaba Group present advancements in embedding-based retrieval (EBR) systems tailored for product search within e-commerce platforms. The study specifically addresses the challenges faced by Taobao Search, such as maintaining high relevance in product retrieval while managing the scalability for vast datasets.

The researchers introduce a novel model named Multi-Grained Deep Semantic Product Retrieval (MGDSPR). This model aims to mitigate the limitations observed in traditional EBR systems, particularly the issues of low relevance to queries and the discrepancy observed between the training and inference phases. Unlike other approaches that often rely on hinge loss, MGDSPR leverages softmax cross-entropy loss, facilitating a global comparison ability during inference, which aligns the training objectives more closely with practical deployment needs and results in faster convergence.

Model Architecture and Components

MGDSPR is structured as a two-tower model comprising user and item towers. The user tower integrates a sophisticated Multi-Grained Semantic unit that captures multiple granularities of search queries, thereby addressing the semantic gaps typically seen in smaller text inputs. This unit constructs semantic representations using pooling strategies, historical data, and attention mechanisms.

The item tower focuses on constructing efficient representations of products, primarily using item IDs and titles. The simplicity of the item representation, noting the empirical observation that complex models like LSTMs are less effective on short, keyword-stacked titles, is one of the model's efficiencies.

Two innovative strategies are presented to enhance product relevance in retrievals:

Smoothing Noisy Training Data: By incorporating a temperature parameter into the softmax function, the model reduces overfitting to noisy click data, thus increasing relevance.
Generating Relevance-improving Hard Negative Samples: This method interpolates between positive examples and hard-negative samples in the embedding space, creating challenging scenarios that refine the model’s discriminative power without extra labeling or training costs.

Deployment and System Integration

The deployment of MGDSPR in Taobao’s production environment is characterized by its integration into a multi-channel retrieval system that handles millions of items. The model works in tandem with an approximate nearest neighbor (ANN) search mechanism to efficiently manage the enormous scale of data. Furthermore, a relevance control module enhances the retrieval process by ensuring exact term matching, thus complementing the fuzzy nature of embedding retrieval with precision necessary for improved user experience.

Experimental Insights and Implications

The effectiveness of MGDSPR is substantiated by offline experiments conducted on vast datasets and through online A/B tests. The model demonstrated superior recall and relevance metrics compared to a strong baseline (a variant of the DNN recommended by Covington et al.), showing a 2.5% improvement in Recall@1000 and a notable leap of 13.3% in good rate at retrieval. Online A/B tests further illustrated substantial improvements in Gross Merchandise Volume (GMV) and the number of relevant transaction items displayed to users.

Implications and Future Work

MGDSPR’s design and deployment signify an appreciable stride in personalizing product retrieval systems by blending semantic depth and operational efficiency. Future research could focus on extending this framework to explore even finer-grained user behavior patterns, potentially leveraging real-time context or combining with advanced user understanding models to further tune the personalization and relevance aspects.

In conclusion, by addressing pertinent challenges in e-commerce search systems, this research not only enhances the practical utility of EBR systems but also contributes significant new methodologies and insights to the academic landscape of information retrieval and e-commerce AI.