ConceptNet 5.5: An Open Multilingual Graph of General Knowledge (1612.03975v2)

Published 12 Dec 2016 in cs.CL

Abstract: Machine learning about language can be improved by supplying it with specific knowledge and sources of external information. We present here a new version of the linked open data resource ConceptNet that is particularly well suited to be used with modern NLP techniques such as word embeddings. ConceptNet is a knowledge graph that connects words and phrases of natural language with labeled edges. Its knowledge is collected from many sources that include expert-created resources, crowd-sourcing, and games with a purpose. It is designed to represent the general knowledge involved in understanding language, improving natural language applications by allowing the application to better understand the meanings behind the words people use. When ConceptNet is combined with word embeddings acquired from distributional semantics (such as word2vec), it provides applications with understanding that they would not acquire from distributional semantics alone, nor from narrower resources such as WordNet or DBPedia. We demonstrate this with state-of-the-art results on intrinsic evaluations of word relatedness that translate into improvements on applications of word vectors, including solving SAT-style analogies.

Citations (2,709)

View on Semantic Scholar

Summary

The paper introduces a hybrid semantic model that merges structured knowledge graphs with distributional word embeddings.
It integrates multilingual data from 83+ languages and diverse sources to build a comprehensive resource for NLP.
It demonstrates improved performance with high correlations on word relatedness tests and competitive SAT analogy accuracy.

ConceptNet 5.5: An Open Multilingual Graph of General Knowledge

In the paper titled "ConceptNet 5.5: An Open Multilingual Graph of General Knowledge," the authors Robyn Speer, Joshua Chin, and Catherine Havasi present an enhanced version of ConceptNet, an extensive knowledge graph aimed at enhancing language understanding in NLP applications through robust, multilingual data. This essay provides an academic overview of the paper, elucidating its contributions, key findings, and future implications for the field of NLP.

ConceptNet serves as a comprehensive knowledge graph that links words and phrases of natural language using labeled edges, thereby representing general knowledge in a structured format. The main novelty in ConceptNet 5.5 is its integration with modern NLP technologies, specifically word embeddings like word2vec. This hybrid approach, combining structured knowledge from ConceptNet with distributional semantics, yields a semantic space more effective than what can be achieved through either distributional semantics or narrower resources like WordNet or DBPedia alone.

Core Contributions

The authors delineate several significant advancements in ConceptNet 5.5:

Multilingual and Multisource Integration: ConceptNet 5.5 merges lexical and world knowledge from diverse sources, encompassing crowd-sourced projects, expert-created data, and "games with a purpose." It supports over 83 languages, with 21 million edges and 8 million nodes, making it a robust resource for multilingual applications.
Hybrid Semantic Space: ConceptNet 5.5 utilizes a method called "retrofitting" to integrate distributional word embeddings with the knowledge graph. This hybrid space, named ConceptNet Numberbatch, surpasses other systems in evaluations of word relatedness and enhances performance in tasks like solving SAT analogies.
Standardized Term Representation: ConceptNet 5.5 redefines term representation by avoiding lemmatization and instead linking inflections with a FormOf relation. This approach allows for more flexible and comprehensive connections between terms in various forms.
Expanded Relation Types: The knowledge graph includes 36 selected relations such as IsA, UsedFor, and CapableOf, designed to capture the nuances of language independently of source language. This structured representation facilitates the alignment of diverse knowledge sources.

Numerical Results

The empirical efficacy of ConceptNet Numberbatch is demonstrated through several evaluations:

Intrinsic Evaluations:
- On the MEN-3000 word relatedness test, ConceptNet Numberbatch achieved a Spearman correlation of 0.866, indicating a high alignment with human judgment on word pair relatedness.
- Similarly, it scored 0.810 on the MTurk-771 test and 0.601 on the Rare Words test, outperforming other standalone systems.
SAT Analogies:
- ConceptNet Numberbatch achieved an accuracy of 56.1% on the SAT-style analogy questions, tying with or surpassing previous best systems like Turney's Latent Relational Analysis (LRA).
Story Cloze Test:
- It scored 59.4%, indicating that embedding-based models can capture enough context to marginally outperform simple baselines in choosing plausible story endings.

Implications and Future Directions

The key implication of this research is the demonstrated benefit of integrating structured knowledge graphs with distributional semantics. This hybrid approach not only improves the intrinsic quality of word embeddings but also enhances their applicability to complex semantic tasks. Practically, applications ranging from search engines to conversational agents can benefit from the enriched contextual understanding provided by such embeddings.

Future directions could involve deeper exploration of relation-specific embeddings and additional optimization techniques for faster convergence in graph embeddings. Moreover, extending these methods to accommodate more languages and domain-specific knowledge graphs could broaden the applicability of ConceptNet further.

Conclusion

ConceptNet 5.5 marks a significant progression in the integration of structured knowledge with NLP. By offering a multilingual, extensively sourced knowledge graph and hybrid word embeddings, it facilitates a robust understanding of language that aligns closely with human cognition. This work serves as a foundational step towards more sophisticated and semantically aware NLP applications.

Overall, the advancements presented in ConceptNet 5.5 underscore the importance of combining different semantic sources to enhance machine understanding of language, setting a precedent for future developments in the field.

PDF Markdown