Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 75 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 98 tok/s Pro
Kimi K2 226 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search (2406.19647v1)

Published 28 Jun 2024 in cs.IR

Abstract: Addressing the "vocabulary mismatch" issue in information retrieval is a central challenge for e-commerce search engines, because product pages often miss important keywords that customers search for. Doc2Query[1] is a popular document-expansion technique that predicts search queries for a document and includes the predicted queries with the document for retrieval. However, this approach can be inefficient for e-commerce search, because the predicted query tokens are often already present in the document. In this paper, we propose Doc2Token, a technique that predicts relevant tokens (instead of queries) that are missing from the document and includes these tokens in the document for retrieval. For the task of predicting missing tokens, we introduce a new metric, "novel ROUGE score". Doc2Token is demonstrated to be superior to Doc2Query in terms of novel ROUGE score and diversity of predictions. Doc2Token also exhibits efficiency gains by reducing both training and inference times. We deployed the feature to production and observed significant revenue gain in an online A/B test, and launched the feature to full traffic on Walmart.com. [1] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, arXiv preprint arXiv:1904.08375 (2019)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces the Doc2Token technique, employing a T5-based seq2seq model to predict novel tokens and address vocabulary gaps in e-commerce search.
  • It demonstrates superior nROUGE scores and efficiency gains over previous methods by reducing redundancy in token prediction.
  • Real-world deployment on Walmart.com yielded a revenue lift of 0.28%, showcasing its practical impact on improving search performance.

Overview of "Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search"

The paper "Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search" explores a novel approach to mitigating the vocabulary mismatch issue that hampers effective e-commerce search. This vocabulary gap arises due to the inherent discrepancies between the terms customers use in their search queries and the keywords embedded in product descriptions by sellers. This misalignment often results in suboptimal search outcomes, where relevant products remain unindexed under the search terms used by customers.

Lexical vs. Semantic Retrieval Approaches

The authors contextualize their work within the broader field of information retrieval, where query expansion and document expansion techniques have been extensively explored. While semantic-based dense retrieval models have made significant strides in understanding the underlying context and meaning of search queries and documents, lexical retrieval remains an essential component due to its interpretability and scalability. Unlike previous work, such as Doc2Query, which focuses on predicting potential queries for documents, the authors of this paper propose an alternative called Doc2Token. This new technique specifically predicts missing tokens relevant to a document for improving search retrieval.

Methodology and Novel Contributions

The central innovation of Doc2Token lies in its precise focus on identifying and predicting "novel tokens" that are not already present in a product's metadata. The authors employ a seq2seq generative model, specifically leveraging the T5 transformer model, to predict these tokens efficiently. Key contributions include:

  1. Doc2Token Technique: The proposed method improves on Doc2Query by exclusively predicting novel tokens rather than potentially redundant queries. This targeted approach reduces inefficiencies in both training and inference phases.
  2. Novel ROUGE Score: A new metric, the "novel ROUGE score" (nROUGE), is introduced to evaluate the effectiveness of predicting novel tokens. It adapts the traditional ROUGE score to account only for the novel aspects, thus providing a more accurate measure of performance in this task.

Experimental Results and Analysis

The authors conducted extensive experiments to validate the proposed method. Key findings include:

  • Superior nROUGE Scores: Doc2Token significantly outperforms Doc2Query in terms of nROUGE precision, recall, and F1 scores, demonstrating its effectiveness in predicting diverse and relevant tokens.
  • Efficiency Gains: The method shows notable improvements in reducing training and inference times as compared to Doc2Query. This efficiency is attributed to the reduced data redundancy and streamlined model design focused on novel token prediction.
  • Online Deployment and Revenue Impact: The method was deployed in production, leading to a statistically significant revenue lift of 0.28% in an A/B test on Walmart.com, underscoring its practical value.

Implications and Future Directions

The results indicate that focusing on token-level expansions rather than query-level can dramatically improve search relevancy in e-commerce settings. This holds practical significance as it directly impacts customer engagement and satisfaction by surfacing more relevant products during searches.

From a theoretical standpoint, the paper contributes to the evolving understanding of how generative models can be applied in information retrieval tasks beyond traditional query or document expansion. The introduction of the nROUGE metric also sets a foundation for future research to evaluate the novelty and relevance of generated content more precisely.

Looking ahead, opportunities for further research include:

  • Enhancement with LLMs: Incorporating more advanced LLMs that exhibit greater contextual understanding and linguistic capabilities could further enhance the precision and recall of novel tokens.
  • Cross-lingual Capabilities: Expanding the model to handle multilingual search queries and product descriptions better, thereby improving accessibility and user experience across diverse linguistic markets.
  • Adaptive Learning Methods: Implementing adaptive learning techniques to continuously improve the model based on real-time search data and evolving customer behavior patterns.

Conclusion

The Doc2Token method represents a significant step forward in addressing the vocabulary gap in e-commerce search engines. By zeroing in on novel token prediction, it ensures more efficient and effective document expansion, ultimately driving better search results and higher customer engagement. The promising results from both offline evaluations and real-world deployment highlight its potential for broader adoption within the industry, and it opens new vistas for leveraging generative models in practical information retrieval contexts.

References

Li, K., Lin, J., & Lee, T. (2022). Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search. eCom’24: ACM SIGIR Workshop on eCommerce, July 18, 2024, Washington, D.C., USA.

Nogueira, R., Yang, W., Cho, K., & Lin, J. (2019). Document expansion by query prediction. arXiv preprint (Nogueira et al., 2019).

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out.

Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research.

Shahi, G., Navin, R., & Roy, B. (2016). Apache Lucene-based full-text search architecture for the unified information management of personnel records. Merit Research Journal of Information & Computer Science and Technology.

Lucene, A. (n.d.). Apache Lucene - Overview. Retrieved from https://lucene.apache.org/core/

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube