Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evolution of Semantic Similarity -- A Survey (2004.13820v2)

Published 19 Apr 2020 in cs.CL and cs.IR

Abstract: Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of NLP. The versatility of natural language makes it difficult to define rule-based methods for determining semantic similarity measures. In order to address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network-based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place, for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.

Citations (238)

Summary

  • The paper offers a comprehensive analysis of semantic similarity approaches, covering knowledge-based, corpus-based, deep neural network, and hybrid methods.
  • It systematically reviews traditional techniques like kernel-based models and innovative transformer architectures to capture contextual meaning.
  • The study highlights challenges such as computational costs and interpretability, while outlining future directions for domain-specific enhancements.

Evolution of Semantic Similarity - A Survey

The paper "Evolution of Semantic Similarity - A Survey" by Dhivya Chandrasekaran and Vijay Mago offers an extensive exploration of the methodologies employed in determining semantic similarity within the domain of NLP. The authors systematically chart the historical and technical evolution of semantic similarity techniques, categorizing them into knowledge-based, corpus-based, deep neural network-based, and hybrid methods.

The survey provides a comprehensive analysis of various methodologies starting with traditional NLP techniques like kernel-based approaches and advancing to contemporary transformer-based models. It categorizes semantic similarity methods based on their fundamental principles, including knowledge-based approaches leveraging ontologies, corpus-based methods utilizing statistical measures, and more recent neural network-based strategies.

Methodologies Addressed

  1. Knowledge-Based Methods: These methods rely on structured lexical databases such as WordNet, Wikipedia, and BabelNet to compute the semantic similarity between terms. They exploit the taxonomical relationships and information content within these knowledge bases. Methods like edge-counting (via path lengths), feature-based (utilizing gloss overlaps), and information content-based computations are key. While these methods benefit from a robust semantic understanding, they are limited by the coverage and specificity of their underlying lexical resources.
  2. Corpus-Based Methods: These methodologies derive semantic similarity by analyzing large textual corpora, based on the distributional hypothesis that similar words appear in similar contexts. Notable innovations include Latent Semantic Analysis (LSA), Word2Vec, and GloVe embeddings which convert words into high-dimensional vectors. This category also explores kernel-based models and dependency parsing techniques, which utilize syntactic structures for similarity calculation.
  3. Deep Neural Network-Based Methods: With advancements in deep learning, architectures like Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), and transformers (e.g., BERT, XLNet) now play a crucial role. These networks learn embeddings from vast corpora, capturing intricate semantic nuances at scale and significantly enhancing the capacity to model semantic similarity. The utilization of attention mechanisms in transformers particularly marks a leap in capturing context-dependent word representations.
  4. Hybrid Methods: The paper observes an increasing trend in hybrid techniques that integrate knowledge-based and corpus-based strategies. These methods aim to combine the interpretability and semantic depth of knowledge-based techniques with the broad applicability of corpus-based models. They often utilize ensemble models to optimize performance across diverse linguistic tasks.

Implications and Future Directions

The survey not only reviews existing approaches but also identifies challenges such as computational demands of deep neural models and the difficulty in model interpretability. Hybrid models exemplify an ongoing attempt to leverage the strengths of multiple techniques while addressing their weaknesses.

The implications of this research domain are profound, impacting a wide range of applications from information retrieval and text summarization to machine translation and question answering systems. It suggests that further advancements might be achieved by developing more effective multi-sense embeddings and improving model efficiency without sacrificing performance.

The paper concludes by advocating for continued exploration into domain-specific embeddings and the construction of ideal corpuses that can enhance model adaptability and semantic accuracy across languages and contexts. By providing such a detailed overview, the survey serves as a valuable resource for researchers aiming to contribute novel insights or enhancements to the field of semantic similarity in NLP.