OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System (2412.20005v2)

Published 28 Dec 2024 in cs.CL, cs.AI, cs.DB, cs.IR, and cs.LG

Abstract: We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at https://github.com/zjunlp/OneKE and released a Video at

.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces OneKE, a system that uses a schema-guided multi-agent architecture with integrated LLMs for efficient knowledge extraction.
It leverages specialized agents to perform schema analysis, data extraction, and error reflection across various formats like HTML and PDF.
Evaluation on benchmark datasets demonstrates significant performance improvements, showcasing its applicability in science, news, and beyond.

Overview of OneKE: A Schema-Guided LLM Agent-based Knowledge Extraction System

The research paper introduces OneKE, a sophisticated knowledge extraction system developed to handle a diverse array of data sources and adapt to various schemas. This system is particularly notable for its integration of LLMs within a structured, dockerized environment, designed to enhance both flexibility and reliability in knowledge extraction tasks across multiple domains such as science and news.

System Architecture and Key Components

OneKE employs a multi-agent system architecture, each fulfilling distinct roles to facilitate comprehensive knowledge extraction. The primary components include:

Schema Agent: This agent is responsible for schema analysis and generation, utilizing LLMs to preprocess various real-world data formats like HTML and PDF. It either selects predefined schemas from a repository or uses LLMs to deduce schemas dynamically when none are provided.
Extraction Agent: Upon receiving schemas, this agent extracts knowledge utilizing multiple LLMs, including open-source models like LLaMA and proprietary models like GPT-4. It enhances performance by learning from similar cases retrieved from a 'Case Repository.'
Reflection Agent: This component is tasked with error recognition and correction, essential for maintaining the accuracy of extracted information. By accessing previously recorded erroneous cases and reflective analyses, it iteratively optimizes the extraction results.
Configure Knowledge Base: This supports the other agents by storing schemas and past extraction cases, which are leveraged for both knowledge extraction and error correction.

Evaluation and Empirical Results

OneKE was evaluated using benchmark datasets like CrossNER for NER tasks and NYT-11-HRL for RE tasks. The system demonstrated significant improvements in performance metrics, especially through the application of case retrieval methods in complex schema scenarios. We observe that leveraging previously successful reasoning paths from stored cases resulted in enhanced extraction accuracy, particularly benefiting more intricate tasks, such as relation extraction.

Implications and Practical Applications

The practical applications of OneKE are robust and multifaceted. In the field of web news extraction, for instance, it facilitates streamlined content parsing and sentiment monitoring, which are crucial for timely risk assessment. Furthermore, in literature contexts like book chapters, OneKE can efficiently extract structured knowledge, thereby aiding various downstream analytics and comprehension tasks.

Future Prospects and Developments

The authors outline plans for the long-term maintenance and expansion of OneKE, including the integration of domain-specific knowledge from additional fields and advancements in the processing of diverse document formats. Such developments are anticipated to further augment the system's applicability and extend its influence across a broader range of knowledge extraction scenarios.

In summary, OneKE represents a substantive advancement in knowledge extraction technology, utilizing a schema-guided, multi-agent architecture underpinned by state-of-the-art LLMs. Its capacity for adaptability, error correction, and schema generalization positions it as a significant tool for researchers and practitioners alike, facilitating enhanced data processing capabilities across an array of domain-specific applications.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (13)

GitHub

GitHub - zjunlp/OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System. (4 stars)

Tweets

https://twitter.com/_reachsumit/status/1873955359544705123

https://twitter.com/rohanpaul_ai/status/1878537615257002343

https://twitter.com/ajeetsraina/status/1878574098558718357

https://twitter.com/javaeeeee1/status/1874780599908704497

https://twitter.com/UFCS/status/1887802975290061148