Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing (2410.12189v2)

Published 16 Oct 2024 in cs.DB and cs.AI

Abstract: Analyzing unstructured data has been a persistent challenge in data processing. LLMs have shown promise in this regard, leading to recent proposals for declarative frameworks for LLM-powered processing of unstructured data. However, these frameworks focus on reducing cost when executing user-specified operations using LLMs, rather than improving accuracy, executing most operations as-is (in a single LLM call). This is problematic for complex tasks and data, where LLM outputs for user-defined operations are often inaccurate, even with optimized prompts. For example, an LLM may struggle to identify {\em all} instances of specific clauses, like force majeure or indemnification, in lengthy legal documents, requiring decomposition of the data, the task, or both. We present DocETL, a system that optimizes complex document processing pipelines, while accounting for LLM shortcomings. DocETL offers a declarative interface for users to define such pipelines and uses an agent-based approach to automatically optimize them, leveraging novel agent-based rewrites (that we call rewrite directives), as well as an optimization and evaluation framework. We introduce (i) logical rewriting of pipelines, tailored for LLM-based tasks, (ii) an agent-guided plan evaluation mechanism that synthesizes and orchestrates task-specific validation prompts, and (iii) an optimization algorithm that efficiently finds promising plans, considering the latencies of agent-based plan generation and evaluation. Our evaluation on four different unstructured document analysis tasks demonstrates that DocETL finds plans with outputs that are 25 to 80% more accurate than well-engineered baselines, addressing a critical gap in unstructured data analysis. DocETL is open-source at docetl.org, and as of November 2024, has amassed over 1.3k GitHub Stars, with users spanning a variety of domains.

Summary

  • The paper introduces an agentic rewriting system that dynamically refines query pipelines for complex document processing.
  • It employs 13 novel rewrite directives and a validation framework to decompose tasks and boost output quality.
  • Extensive evaluations show significant improvements in fields like legal and governmental data analysis.

Analysis of "DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing"

The paper presents "DocETL," a sophisticated system designed to enhance document processing capabilities using LLMs. This system introduces a declarative framework aimed at optimizing pipelines for unstructured data processing, addressing the inherent challenges of accuracy and task complexity.

System Overview

DocETL leverages an innovative approach to process complex documents, offering a declarative interface that allows users to define processing pipelines. Central to its operation are LLM-based transformations, where various operators such as map, reduce, split, and resolve perform distinct functions. Notably, the system includes {\em resolve} for entity resolution—a critical feature for ensuring data consistency across documents.

Novel Contributions

DocETL distinguishes itself by focusing not only on cost reduction but also on improving the accuracy and quality of LLM outputs. To address LLM limitations on lengthy and complex documents, the authors introduce 13 novel rewrite directives designed for logical decomposition. These directives guide the synthesis of optimized plans for performing semantic projections more effectively, breaking down complex tasks into more manageable sub-tasks.

Key contributions include:

  • Agent-Driven Rewriting: Utilizing LLM agents, DocETL adapts pipelines by rewriting operations according to novel directives. This dynamic rewriting capability caters to the personalized nature of each task and dataset.
  • Validation Framework: Agentic validation mechanisms are employed to ensure output quality, leveraging custom prompts to evaluate the successful execution of given tasks.
  • Opportunistic Sub-plan Optimization: By recursively applying rewrite directives, the system selectively optimizes complex operations into simpler, more effective components, focusing efforts where the most impact can be achieved.

Evaluation

Through extensive empirical evaluation across multiple document processing tasks, DocETL demonstrates superior performance. For instance, in police misconduct analysis, the system achieved outputs 1.34 to 4.6 times higher in quality than traditional approaches. These measurements underscore the system's effectiveness in handling complex unstructured data scenarios.

Implications and Future Directions

DocETL exemplifies a forward-thinking approach in using LLMs for document processing, pushing the boundary of current AI applications. The implications are significant, particularly in fields requiring rigorous document analysis—such as legal, medical, and governmental datasets.

The theoretical advancements offered by DocETL pave the way for further exploration into adaptive LLM-powered systems. Future work might explore integrating more advanced model architectures to handle increasingly intricate data structures, as well as expanding task-specific optimizations to bolster the system's versatility.

Conclusion

DocETL represents a meaningful contribution to document processing by efficiently harnessing LLMs to overcome challenges of scale and complexity, thereby offering a robust solution for a wide range of applications. Its declarative nature, combined with agent-driven optimization strategies, sets a new standard for intelligent document processing and lays a solid foundation for future research in the field.

Youtube Logo Streamline Icon: https://streamlinehq.com