How to Generate a Good Word Embedding? (1507.05523v1)

Published 20 Jul 2015 in cs.CL

Abstract: We analyze three critical components of word embedding training: the model, the corpus, and the training parameters. We systematize existing neural-network-based word embedding algorithms and compare them using the same corpus. We evaluate each word embedding in three ways: analyzing its semantic properties, using it as a feature for supervised tasks and using it to initialize neural networks. We also provide several simple guidelines for training word embeddings. First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large. Third, the early stopping metric for iterating should rely on the development set of the desired task rather than the validation loss of training embedding.

Citations (340)

View on Semantic Scholar

Summary

The paper shows that models predicting target words achieve superior semantic performance compared to joint context scoring approaches.
It demonstrates that selecting a domain-specific corpus is more impactful than merely increasing corpus size for effective embeddings.
The analysis emphasizes that optimal training parameters, such as using development set performance for early stopping and proper dimensionality, enhance embedding efficacy.

Analysis of Word Embedding Generation: Models, Corpus, and Training Parameters

The presented paper provides a comprehensive exploration into the generation of word embeddings, concentrating on the critical components that affect their training: the model architecture, the choice of corpus, and the training parameters. This detailed analysis fills an existing gap in the field by offering a systematic comparison of varying word embedding models under consistent evaluation conditions, which were applied across a common corpus.

Model Evaluation and Insights

The paper compares several widely known models for generating word embeddings, including the SCBOW, Order, LBL, NNLM, and C&W approaches. These models were assessed based on their ability to capture semantic properties, their efficacy as features in supervised tasks, and their utility as initializations for neural networks. A key observation is the differentiation in terms of model complexity versus the size of the training corpus. Simpler models perform sufficiently well with smaller corpora, while more complex models require larger corpora to demonstrate their full potential.

Moreover, the analysis reveals that models predicting target words tend to outperform those that score the joint representation of target words and contexts, particularly in tasks evaluating semantic properties. For instance, models using distributional hypotheses to predict target words exhibit higher performance gains compared to the C&W model, which places the target word alongside its context in the input layer.

Corpus Selection and Impact

One of the paper’s significant conclusions is the importance of domain-specific corpus selection. It points out that while larger corpora might generally yield superior results, selecting a corpus from a domain that aligns closely with the target task is crucial. Intriguingly, the paper identifies that corpus domain outweighs the mere increase in corpus size. This discovery suggests that utilizing a larger corpus is beneficial only after the most appropriate domain is chosen for the intended task. For example, the paper notes that semantic tasks benefit considerably from information-rich sources like Wikipedia compared to more niche data from the New York Times.

Training Parameters Consideration

The authors also delve into the practicalities of optimizing training parameters, emphasizing the relevance of stopping criteria and dimensionality choices. They argue convincingly against using validation loss as an early stopping metric, proposing instead the performance on a development set for the target task as a more reliable indicator to halt training. The discussion on dimensionality is equally enlightening, highlighting that larger dimensions are advantageous for semantic evaluations, while a lower dimensionality suffices for tasks where embeddings serve as features or initializations.

Broader Implications and Future Directions

The implications of this research stretch both theoretically and practically. For practitioners, the insights into selecting model complexity based on corpus characteristics and task requirements provide valuable guidelines. For theorists, the findings about the effect of corpus domain on the effectiveness of embeddings stimulate further inquiries into the interactions between data characteristics and model performance.

Looking towards future developments in AI, it is evident that understanding and refining the interplay between model design, data selection, and training methodology is essential for advancing the field of natural language processing. As the landscape of computational models continues to evolve, the principles detailed in this paper will likely serve as foundational aspects influencing future approaches to word embedding generation.

PDF Markdown