Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation (1607.05368v1)

Published 19 Jul 2016 in cs.CL

Abstract: Recently, Le and Mikolov (2014) proposed doc2vec as an extension to word2vec (Mikolov et al., 2013a) to learn document-level embeddings. Despite promising results in the original paper, others have struggled to reproduce those results. This paper presents a rigorous empirical evaluation of doc2vec over two tasks. We compare doc2vec to two baselines and two state-of-the-art document embedding methodologies. We found that doc2vec performs robustly when using models trained on large external corpora, and can be further improved by using pre-trained word embeddings. We also provide recommendations on hyper-parameter settings for general purpose applications, and release source code to induce document embeddings using our trained doc2vec models.

Citations (629)

Summary

  • The paper demonstrates that the simpler dbow variant consistently outperforms dmpv in document embedding tasks.
  • The paper reveals that doc2vec, when trained on large external corpora and with pre-trained embeddings, outperforms baseline models like word2vec-averaging.
  • The paper highlights that meticulous hyper-parameter tuning, especially of the sub-sampling threshold, significantly improves model performance.

An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation

The paper by Lau and Baldwin presents a critical empirical evaluation of the doc2vec model, which was introduced as an extension of the word2vec algorithm to generate document-level embeddings. The authors address a significant challenge: the difficulty in reproducing the encouraging results reported in original doc2vec studies.

Study Objectives and Methodology

The authors set out to clarify several key aspects of doc2vec, such as its effectiveness across diverse tasks, performance comparison between its variations (dbow and dmpv), and the benefits of hyper-parameter optimization and pre-trained embeddings. Their comprehensive evaluation spans two main NLP tasks: duplicate question detection in forums (Q-Dup) and Semantic Textual Similarity (STS).

Key Findings

The paper reveals the following insights:

  1. Model Performance: dbow consistently outperforms dmpv across most tasks, challenging prior claims about the superiority of dmpv. This is evident despite dbow being a simpler model that disregards word order.
  2. Embedding Efficacy: doc2vec, especially when trained on large external corpora, surpasses simpler baseline methods such as word2vec-averaging and ngram models. The model is notably effective with longer documents.
  3. Hyper-parameter Relevance: The sensitivity of doc2vec to hyper-parameter settings, particularly the sub-sampling threshold, was highlighted. The paper provides optimized settings for varying tasks, significantly enhancing model performance.
  4. External Corpora Training: doc2vec maintains strong performance with external corpora such as Wikipedia and AP News, showing its adaptability as an off-the-shelf model.
  5. Pre-trained Word Embeddings: Incorporating pre-trained embeddings within dbow enhances document representation, potentially accelerating model convergence and improving performance.

Implications and Future Directions

The findings underscore the robustness and versatility of doc2vec in learning meaningful document representations. The empirical evidence challenges previous assumptions about its limitations compared to alternative methods like skip-thought and paragram-phrase embeddings. Furthermore, the improvement observed with pre-trained embeddings could inspire further exploration into hybrid training techniques for more refined embeddings.

From a theoretical perspective, these insights contribute to a deeper understanding of embedding models' behavior, particularly the importance of initialization and data scale. Practically, the release of source code and trained models by the authors facilitates broader adoption and benchmarking consistency.

Future research may explore extending doc2vec's applicability to other languages and domains, as well as experimenting with ensemble methods that combine multiple embedding strategies to further enhance document representation capabilities. The interplay between embedding approaches and end-task architectures also presents fertile ground for investigation, potentially leading to more contextually aware and dynamic representations in the evolving field of natural language processing.