Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 85 tok/s
GPT OSS 120B 468 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System (2412.20005v2)

Published 28 Dec 2024 in cs.CL, cs.AI, cs.DB, cs.IR, and cs.LG

Abstract: We introduce OneKE, a dockerized schema-guided knowledge extraction system, which can extract knowledge from the Web and raw PDF Books, and support various domains (science, news, etc.). Specifically, we design OneKE with multiple agents and a configure knowledge base. Different agents perform their respective roles, enabling support for various extraction scenarios. The configure knowledge base facilitates schema configuration, error case debugging and correction, further improving the performance. Empirical evaluations on benchmark datasets demonstrate OneKE's efficacy, while case studies further elucidate its adaptability to diverse tasks across multiple domains, highlighting its potential for broad applications. We have open-sourced the Code at https://github.com/zjunlp/OneKE and released a Video at

.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces OneKE, a system that uses a schema-guided multi-agent architecture with integrated LLMs for efficient knowledge extraction.
  • It leverages specialized agents to perform schema analysis, data extraction, and error reflection across various formats like HTML and PDF.
  • Evaluation on benchmark datasets demonstrates significant performance improvements, showcasing its applicability in science, news, and beyond.

Overview of OneKE: A Schema-Guided LLM Agent-based Knowledge Extraction System

The research paper introduces OneKE, a sophisticated knowledge extraction system developed to handle a diverse array of data sources and adapt to various schemas. This system is particularly notable for its integration of LLMs within a structured, dockerized environment, designed to enhance both flexibility and reliability in knowledge extraction tasks across multiple domains such as science and news.

System Architecture and Key Components

OneKE employs a multi-agent system architecture, each fulfilling distinct roles to facilitate comprehensive knowledge extraction. The primary components include:

  1. Schema Agent: This agent is responsible for schema analysis and generation, utilizing LLMs to preprocess various real-world data formats like HTML and PDF. It either selects predefined schemas from a repository or uses LLMs to deduce schemas dynamically when none are provided.
  2. Extraction Agent: Upon receiving schemas, this agent extracts knowledge utilizing multiple LLMs, including open-source models like LLaMA and proprietary models like GPT-4. It enhances performance by learning from similar cases retrieved from a 'Case Repository.'
  3. Reflection Agent: This component is tasked with error recognition and correction, essential for maintaining the accuracy of extracted information. By accessing previously recorded erroneous cases and reflective analyses, it iteratively optimizes the extraction results.
  4. Configure Knowledge Base: This supports the other agents by storing schemas and past extraction cases, which are leveraged for both knowledge extraction and error correction.

Evaluation and Empirical Results

OneKE was evaluated using benchmark datasets like CrossNER for NER tasks and NYT-11-HRL for RE tasks. The system demonstrated significant improvements in performance metrics, especially through the application of case retrieval methods in complex schema scenarios. We observe that leveraging previously successful reasoning paths from stored cases resulted in enhanced extraction accuracy, particularly benefiting more intricate tasks, such as relation extraction.

Implications and Practical Applications

The practical applications of OneKE are robust and multifaceted. In the field of web news extraction, for instance, it facilitates streamlined content parsing and sentiment monitoring, which are crucial for timely risk assessment. Furthermore, in literature contexts like book chapters, OneKE can efficiently extract structured knowledge, thereby aiding various downstream analytics and comprehension tasks.

Future Prospects and Developments

The authors outline plans for the long-term maintenance and expansion of OneKE, including the integration of domain-specific knowledge from additional fields and advancements in the processing of diverse document formats. Such developments are anticipated to further augment the system's applicability and extend its influence across a broader range of knowledge extraction scenarios.

In summary, OneKE represents a substantive advancement in knowledge extraction technology, utilizing a schema-guided, multi-agent architecture underpinned by state-of-the-art LLMs. Its capacity for adaptability, error correction, and schema generalization positions it as a significant tool for researchers and practitioners alike, facilitating enhanced data processing capabilities across an array of domain-specific applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com