- The paper introduces the Doc2Token technique, employing a T5-based seq2seq model to predict novel tokens and address vocabulary gaps in e-commerce search.
- It demonstrates superior nROUGE scores and efficiency gains over previous methods by reducing redundancy in token prediction.
- Real-world deployment on Walmart.com yielded a revenue lift of 0.28%, showcasing its practical impact on improving search performance.
Overview of "Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search"
The paper "Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search" explores a novel approach to mitigating the vocabulary mismatch issue that hampers effective e-commerce search. This vocabulary gap arises due to the inherent discrepancies between the terms customers use in their search queries and the keywords embedded in product descriptions by sellers. This misalignment often results in suboptimal search outcomes, where relevant products remain unindexed under the search terms used by customers.
Lexical vs. Semantic Retrieval Approaches
The authors contextualize their work within the broader field of information retrieval, where query expansion and document expansion techniques have been extensively explored. While semantic-based dense retrieval models have made significant strides in understanding the underlying context and meaning of search queries and documents, lexical retrieval remains an essential component due to its interpretability and scalability. Unlike previous work, such as Doc2Query, which focuses on predicting potential queries for documents, the authors of this paper propose an alternative called Doc2Token. This new technique specifically predicts missing tokens relevant to a document for improving search retrieval.
Methodology and Novel Contributions
The central innovation of Doc2Token lies in its precise focus on identifying and predicting "novel tokens" that are not already present in a product's metadata. The authors employ a seq2seq generative model, specifically leveraging the T5 transformer model, to predict these tokens efficiently. Key contributions include:
- Doc2Token Technique: The proposed method improves on Doc2Query by exclusively predicting novel tokens rather than potentially redundant queries. This targeted approach reduces inefficiencies in both training and inference phases.
- Novel ROUGE Score: A new metric, the "novel ROUGE score" (nROUGE), is introduced to evaluate the effectiveness of predicting novel tokens. It adapts the traditional ROUGE score to account only for the novel aspects, thus providing a more accurate measure of performance in this task.
Experimental Results and Analysis
The authors conducted extensive experiments to validate the proposed method. Key findings include:
- Superior nROUGE Scores: Doc2Token significantly outperforms Doc2Query in terms of nROUGE precision, recall, and F1 scores, demonstrating its effectiveness in predicting diverse and relevant tokens.
- Efficiency Gains: The method shows notable improvements in reducing training and inference times as compared to Doc2Query. This efficiency is attributed to the reduced data redundancy and streamlined model design focused on novel token prediction.
- Online Deployment and Revenue Impact: The method was deployed in production, leading to a statistically significant revenue lift of 0.28% in an A/B test on Walmart.com, underscoring its practical value.
Implications and Future Directions
The results indicate that focusing on token-level expansions rather than query-level can dramatically improve search relevancy in e-commerce settings. This holds practical significance as it directly impacts customer engagement and satisfaction by surfacing more relevant products during searches.
From a theoretical standpoint, the paper contributes to the evolving understanding of how generative models can be applied in information retrieval tasks beyond traditional query or document expansion. The introduction of the nROUGE metric also sets a foundation for future research to evaluate the novelty and relevance of generated content more precisely.
Looking ahead, opportunities for further research include:
- Enhancement with LLMs: Incorporating more advanced LLMs that exhibit greater contextual understanding and linguistic capabilities could further enhance the precision and recall of novel tokens.
- Cross-lingual Capabilities: Expanding the model to handle multilingual search queries and product descriptions better, thereby improving accessibility and user experience across diverse linguistic markets.
- Adaptive Learning Methods: Implementing adaptive learning techniques to continuously improve the model based on real-time search data and evolving customer behavior patterns.
Conclusion
The Doc2Token method represents a significant step forward in addressing the vocabulary gap in e-commerce search engines. By zeroing in on novel token prediction, it ensures more efficient and effective document expansion, ultimately driving better search results and higher customer engagement. The promising results from both offline evaluations and real-world deployment highlight its potential for broader adoption within the industry, and it opens new vistas for leveraging generative models in practical information retrieval contexts.
References
Li, K., Lin, J., & Lee, T. (2022). Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search. eCom’24: ACM SIGIR Workshop on eCommerce, July 18, 2024, Washington, D.C., USA.
Nogueira, R., Yang, W., Cho, K., & Lin, J. (2019). Document expansion by query prediction. arXiv preprint (Nogueira et al., 2019).
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out.
Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.
Shahi, G., Navin, R., & Roy, B. (2016). Apache Lucene-based full-text search architecture for the unified information management of personnel records. Merit Research Journal of Information & Computer Science and Technology.
Lucene, A. (n.d.). Apache Lucene - Overview. Retrieved from https://lucene.apache.org/core/