Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 85 tok/s
Gemini 2.5 Pro 36 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 72 tok/s Pro
Kimi K2 170 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases (2009.11564v2)

Published 24 Sep 2020 in cs.AI and cs.DB

Abstract: Equipping machines with comprehensive knowledge of the world's entities and their relationships has been a long-standing goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics. This article surveys fundamental concepts and practical methods for creating and curating large knowledge bases. It covers models and methods for discovering and canonicalizing entities and their semantic types and organizing them into clean taxonomies. On top of this, the article discusses the automatic extraction of entity-centric properties. To support the long-term life-cycle and the quality assurance of machine knowledge, the article presents methods for constructing open schemas and for knowledge curation. Case studies on academic projects and industrial knowledge graphs complement the survey of concepts and methods.

Citations (118)

Summary

  • The paper presents automated construction techniques that extract, canonicalize, and organize information from diverse web and text sources.
  • It demonstrates leveraging pretrained language models and generative approaches to effectively capture and validate both factual and commonsense knowledge.
  • The research emphasizes robust schema management and alignment methods to integrate multiple KBs, enhancing performance in NLP and AI tasks.

Machine Knowledge involves the creation and curation of comprehensive knowledge bases (KBs) or knowledge graphs (KGs) that represent factual and relational information about the world. These curated repositories play a critical role in various applications, including search engines, NLP, and AI.

Creation of Knowledge Bases

  1. Automated Construction: Large-scale knowledge bases can be constructed automatically from web contents and text sources. This process involves discovering and canonicalizing entities, defining their relationships, and organizing them into clean taxonomies (2009.11564). Automated methods mitigate the manual effort required, ensuring scalability and broader coverage.
  2. Leveraging LLMs: Recent advancements suggest that pretrained LMs like BERT can function as unsupervised knowledge bases. These models store relational and factual knowledge within their parameters, enabling them to answer queries in "fill-in-the-blank" formats quite effectively (Petroni et al., 2019). This method bypasses the need for structured schemas and extensive human annotation, though it may still require validation against structured KBs for accuracy.
  3. Generative Approaches: Models such as COMET use generative techniques to construct KBs by producing natural language descriptions of commonsense knowledge. COMET has shown promising results in creating knowledge entries that are highly rated by humans in terms of quality (Bosselut et al., 2019).

Curation and Maintenance

  1. Schema and Taxonomy Management: Proper maintenance of knowledge bases involves constructing open schemas and ensuring quality through continuous curation. This includes enhancing the schema to incorporate emerging entities and relationships, and the canonicalization of new data entries (2009.11564).
  2. Zero-Shot Learning: The SPIRES method employs zero-shot learning to extract information from unstructured text without needing prior training data. By using LLMs for recursive prompt interrogation, SPIRES allows for flexible and precise knowledge extraction that aligns with specified schemas (Caufield et al., 2023).
  3. Alignment and Integration: The Simple Greedy Matching (SiGMa) algorithm tackles the challenge of aligning multiple large-scale knowledge bases by leveraging structural information and similarity measures. Efficient alignment ensures that complementary information across different sources is unified effectively (Lacoste-Julien et al., 2012).

Applications and Implications

  1. Enhancing NLP and Machine Reading: Integrating knowledge bases into machine learning models boosts performance in tasks like entity and event extraction. For instance, the KBLSTM model enhances recurrent neural networks by using continuous representations from knowledge bases, leading to improved accuracy in machine reading tasks (Yang et al., 2019).
  2. Commonsense and Temporal Reasoning: Advances in knowledge graph research, especially in commonsense and temporal reasoning, are crucial for building more sophisticated AI systems that mimic human cognition (Ji et al., 2020).
  3. Interpretable Models: Developing interpretable models for knowledge base completion, such as ITransF, allows easy interpretation of learned associations, thus providing transparency and trustworthiness in AI systems (Xie et al., 2017).

Overall, significant progress has been made in both the creation and curation of comprehensive knowledge bases, propelled by advancements in AI and NLP. These efforts contribute to a better-organized contingent of machine knowledge capable of supporting diverse and complex applications.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.