SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation (1408.3456v1)

Published 15 Aug 2014 in cs.CL

Abstract: We present SimLex-999, a gold standard resource for evaluating distributional semantic models that improves on existing resources in several important ways. First, in contrast to gold standards such as WordSim-353 and MEN, it explicitly quantifies similarity rather than association or relatedness, so that pairs of entities that are associated but not actually similar [Freud, psychology] have a low rating. We show that, via this focus on similarity, SimLex-999 incentivizes the development of models with a different, and arguably wider range of applications than those which reflect conceptual association. Second, SimLex-999 contains a range of concrete and abstract adjective, noun and verb pairs, together with an independent rating of concreteness and (free) association strength for each pair. This diversity enables fine-grained analyses of the performance of models on concepts of different types, and consequently greater insight into how architectures can be improved. Further, unlike existing gold standard evaluations, for which automatic approaches have reached or surpassed the inter-annotator agreement ceiling, state-of-the-art models perform well below this ceiling on SimLex-999. There is therefore plenty of scope for SimLex-999 to quantify future improvements to distributional semantic models, guiding the development of the next generation of representation-learning architectures.

Authors (3)

Felix Hill (52 papers)
Roi Reichart (82 papers)
Anna Korhonen (90 papers)

Citations (1,278)

View on Semantic Scholar

Summary

SimLex-999: Evaluating Semantic Models With Genuine Similarity Estimation

The paper "SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation" presents a novel dataset designed to improve the evaluation of distributional semantic models. Authored by Felix Hill, Roi Reichart, and Anna Korhonen, the dataset addresses a critical distinction in semantics: genuine similarity versus association. While previous evaluation metrics like WordSim-353 (WS-353) and MEN often conflated these two notions, SimLex-999 focuses on similarity, making it a valuable tool for validating models intended for applications that require precise similarity metrics.

Key Contributions

The key contributions of SimLex-999 are as follows:

Explicit Focus on Similarity: Unlike WS-353 and MEN which include pairs that are related but not similar (e.g., "coffee" and "cup"), SimLex-999 discriminates between similarity and mere association. This is crucial because models penalized for learning such distinctions can now be accurately assessed.
Diverse Concept Types: SimLex-999 encompasses a balanced range of nouns, verbs, and adjectives, considering both concreteness and abstractness levels. This diverse lexical coverage ensures it can address various complexities in the semantics of different word categories.
Performance Analysis Beyond Human Agreement Ceilings: State-of-the-art distributional semantic models were found to perform well below the human inter-annotator agreement ceiling on SimLex-999. This indicates significant room for quantitative improvement in future models, unlike in prior datasets where human performance limits have already been approached or exceeded.

Implications for Distributional Semantic Models

The SimLex-999 dataset provides a robust framework for evaluating the nuanced task of similarity estimation. It allows researchers to diagnose and address weaknesses in their models more precisely. For instance, the dataset reveals that current models tend to align more closely with association rather than true similarity, suggesting an over-reliance on corpus co-occurrence statistics without capturing deeper semantic relations.

Additionally, experiments demonstrate that models based on dependency-parsed inputs, like those proposed by Levy and Goldberg, have an improved ability to distinguish between similarity and association. This insight suggests that leveraging syntactic information can enhance the quality of semantic representations. In contrast, mere adjustments to context window sizes yield mixed results, indicating that window size is less critical than previously thought for modeling similarity.

Practical and Theoretical Implications

Practically, the insights derived from evaluating models on SimLex-999 can influence various NLP applications—from ontology generation to machine translation—where the precision of semantic similarity is paramount. The theoretical implications are equally profound, urging a rethinking of how models are structured to incorporate richer, more conceptually-grounded information.

SimLex-999 underscores the necessity of multi-modal learning approaches that integrate perceptual data along with text-based input to achieve a fuller representation of concrete concepts. The resulting advancements could push the boundaries of representation learning, making systems better suited for natural language understanding and related tasks.

Future Directions

Future research inspired by SimLex-999 may explore various advancements:

Enhanced Dependency Parsing: Further refining models to utilize dependency-parsed inputs could provide a clearer separation between similarity and association.
Multi-Modal Learning Models: Integrating visual and other perceptual data in semantic model training could be vital for improving the representation of concrete concepts.
Beyond Similarity: Investigating other complex dimensions of semantics such as intensionality, polarity, and subjectivity can provide a broader scope for model improvement.

In conclusion, SimLex-999 represents a significant step forward in the semantic evaluation landscape. It provides a clearer, more rigorous framework to assess and drive improvements in the field of distributional semantics, encouraging the development of next-generation models that can more accurately mirror human-like understanding of language similarity.

PDF Markdown