Word Embeddings: A Survey (1901.09069v2)

Published 25 Jan 2019 in cs.CL, cs.LG, and stat.ML

Abstract: This work lists and describes the main recent strategies for building fixed-length, dense and distributed representations for words, based on the distributional hypothesis. These representations are now commonly called word embeddings and, in addition to encoding surprisingly good syntactic and semantic information, have been proven useful as extra features in many downstream NLP tasks.

Authors (2)

Felipe Almeida (8 papers)
Geraldo Xexéo (5 papers)

Citations (184)

View on Semantic Scholar

Summary

The paper reviews prediction-based and count-based models, detailing their methodologies and key differences.
The paper demonstrates how word embeddings capture both syntactic and semantic relationships through numerical vectors.
The paper discusses the impact on NLP tasks and explores future directions like handling rare words and developing multilingual embeddings.

Understanding Word Embeddings in NLP

Introduction

Representing words in a way that computers can efficiently utilize is crucial for NLP. One effective approach is to use vectors, which allow us to apply mathematical operations and integrate word representations into various Machine Learning (ML) techniques. Words as vectors are not just numbers—they encode syntactic and semantic information, enabling various applications like sentiment analysis and LLMing.

In this article, we'll explore the strategies for building word embeddings, categorizing them into prediction-based and count-based models.

The Vector Space Model and Statistical LLMling

The Vector Space Model

The Vector Space Model (VSM) transformed the way we handle text data. Introduced by Gerard Salton, it represents documents as vectors, where each dimension corresponds to a term. This method supports operations like calculating similarities between documents using measures such as the inner product.

Statistical LLMling

LLMs predict the probability of a word based on preceding words—a task essential for applications like speech recognition. These models often use the concept of n-grams, focusing on sequences of n words to predict the next word. Although creating full probabilistic models for large vocabularies is computationally intensive, methods like smoothing and using recurrent neural networks (RNNs) have improved efficiency and accuracy.

Types of Word Embeddings

Word embeddings are fixed-length vectors representing words. We categorize them into two main types based on the methods used to generate them: prediction-based models and count-based models.

Prediction-based Models

These models derive embeddings from neural network LLMs, leveraging word context to predict the next word. Here's a brief history of some key advancements:

Bengio et al. (2003): Introduced embeddings as a by-product of Neural Network LLMs (NNLMs).
Collobert and Weston (2008): Focused on deriving embeddings by training a model on unsupervised and supervised data.
Mikolov et al. (2013): Developed the Continuous Bag-of-Words (CBOW) and Skip-Gram (SG) models, which predict the target word from context or vice versa.

Crucially, Mikolov et al. discovered that word embeddings encoded surprising semantic relationships, such as "king - man + woman = queen."

Count-based Models

These models use global word-context co-occurrence statistics to generate embeddings:

Latent Semantic Analysis (LSA): Applies Singular Value Decomposition (SVD) to a term-document matrix.
GloVe (Pennington et al., 2014): A log-linear model leveraging co-occurrence ratios to encode semantic information.

Count-based models focus on global statistics, as opposed to the local context used in prediction-based models.

Implications and Future Directions

Word embeddings have significantly impacted various NLP tasks, from chunking and parsing to sentiment analysis and question answering. Toolkits like Word2Vec and GloVe have made these embeddings accessible, enabling more complex and accurate NLP applications.

Looking ahead, there's potential for further improvements:

Handling Rare Words: Enhancements to better handle low-frequency words could broaden the applicability of embeddings.
Multilingual Embeddings: Developing embeddings that work across multiple languages can benefit global NLP applications.
Contextual Embeddings: Incorporating more contextual information could lead to more nuanced understanding of words.

In summary, word embeddings have become an essential tool in NLP, thanks to both prediction-based and count-based approaches, each with its own strengths. As AI research continues to evolve, we can expect even more sophisticated and versatile word embeddings in the future.

PDF Markdown