AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Published 29 May 2025 in cs.CL and cs.AI | (2505.23628v3)

Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages LLMs to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 92\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in LLMs.

Abstract PDF Upgrade to Chat

Authors (20)

First 10 authors:

Summary

The paper presents a novel framework that autonomously constructs knowledge graphs through dynamic schema induction using large language models.
It extracts entities and events from web-scale corpora and conceptualizes them into flexible schemas without relying on predefined structures.
The constructed ATLAS knowledge graph, with over 900 million nodes and 5.9 billion edges, outperforms multi-hop QA tasks with 95% semantic alignment to human-crafted schemas.

AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Introduction

The transformation of unstructured data into structured, machine-readable formats is one of the key challenges in the field of artificial intelligence. Knowledge Graphs (KGs) have established themselves as an invaluable tool for organizing such data, but traditional approaches to KG construction often rely on predefined schemas, which limits scalability and adaptability. The "AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora" paper presents a novel approach to overcome these limitations by introducing a framework that autonomously constructs knowledge graphs without the need for predefined schemas. AutoSchemaKG exploits the capabilities of LLMs to extract entities, events, and their interrelationships directly from text, creating dynamic and flexible schemas.

AutoSchemaKG Framework

AutoSchemaKG is designed to automate the KG construction process through four major stages:

Input Processing: During this phase, documents are filtered, segmented, and batched to prepare them for further analysis and extraction. This phase ensures that the data is organized and manageable for processing by the LLMs.
Triple Extraction: In this stage, entities and events are extracted from the processed text, and relationships between them are characterized using LLM-generated prompts. This allows for capturing the underlying semantic structure of the given corpus and forming the foundational triples for the knowledge graph.
Schema Induction: The extracted elements undergo conceptualization into abstract categories, which are not bound by any predefined schemas. This allows the constructed knowledge graph to dynamically adjust and represent complex relationships without the constraints of static ontologies.
Knowledge Graph Construction: Finally, the extracted triples and schemas are integrated into the ATLAS knowledge graph. The nodes and relationships are visualized within the framework, characterized by entity nodes, event nodes, concept nodes, and relation edges.
Figure 1: The AutoSchemaKG pipeline for autonomous knowledge graph construction.

Conceptualization and Event Modeling

Although traditional KGs tend to focus solely on entities and their static properties, AutoSchemaKG distinguishes itself by also modeling events as primary semantic units. This approach acknowledges the dynamic nature of real-world information and aims to represent temporal relationships and causal structures that entity-centric graphs fail to capture.

The conceptualization process in AutoSchemaKG leverages LLMs to convert specific instances into abstract concepts, which serve as a key component in facilitating reasoning and zero-shot inference across domains. This aspect of schema induction underscores an advanced abstraction mechanism, which yields a multi-dimensional knowledge representation and aids in combating sparsity issues in knowledge graphs.

Figure 2: An example of how event nodes (green) enrich the context of knowledge representation beyond simple triples.

Experimental Results and Applications

The ATLAS knowledge graphs, constructed using the AutoSchemaKG framework, consist of over 900 million nodes and more than 5.9 billion edges. Evaluation of the AutoSchemaKG framework demonstrates superior performance on multi-hop question answering tasks when compared to state-of-the-art baselines. Moreover, the dynamic schema induction methodology shows a remarkable 95% semantic alignment with human-crafted schemas without requiring manual intervention, validating the efficacy of automated concepts.

Limitations and Future Directions

Despite these advancements, the computational demands of constructing large-scale KGs remain a significant consideration, requiring extensive GPU resources and optimization strategies. The potential biases inherited from the LLMs used could affect the quality of the KG, especially in niche or highly specialized domains. Future research may focus on mitigating these limitations by enhancing efficient schema adaptation mechanisms and exploring new models for better capturing domain-specific knowledge.

Conclusion

AutoSchemaKG represents a significant move towards fully automated, adaptable knowledge graph construction. By eliminating the constraints of predefined schemas and leveraging the strengths of LLMs, this framework is poised to broaden the application scope of KGs, furnishing a robust tool for enhanced AI reasoning across varied and dynamic domains.

Markdown Report Issue