DocETL: Agentic Document Processing Framework
- DocETL is a declarative, agentic framework for processing unstructured documents using a YAML DSL and LLM-powered operators.
- Its agent-based rewriting and logical–physical decomposition optimize pipeline execution, significantly boosting accuracy metrics.
- Validation agents empirically select the best plans, enhancing extraction recall, precision, and data completeness in diverse domains.
DocETL designates a class of agentic, optimization-driven frameworks for declaratively specifying, rewriting, and executing complex unstructured document processing pipelines—and is most prominently used in the context of LLM-augmented document extraction, transformation, and querying. DocETL systems are characterized by agent-based logical and physical pipeline rewriting, adaptive orchestration of LLM calls (with purpose-specific prompt and evaluation synthesis), and automatic plan selection guided by in situ accuracy metrics rather than cost minimization. These frameworks mark a significant shift from cost-centric LLM orchestration to accuracy-centric multimodal document analytics, enabling robust processing of corpora in domains such as legal, biomedical, and governmental records.
1. Declarative Pipeline Specification and System Structure
DocETL uses a domain-specific YAML DSL permitting users to specify document analytics workflows as pipelines of high-level operators. Supported operator types include both LLM-powered primitives (map, reduce, filter, resolve, equijoin) and auxiliary structuring steps (split, gather, unnest). Each operator expresses a transformation or extraction intent, which is mapped to LLM calls embedded with explicit, operator-local prompts.
The system compiles these specifications into executable plans, decomposing complex documents into tractable “chunks” and orchestrating processing, aggregation, and join logic across the document’s structure.
DocETL core architecture includes two interacting agentic subsystems:
- Generation agents: responsible for automatically rewriting pipelines according to logical and physical rewrite directives. These agents expand, decompose, and optimize stages of the pipeline by generating alternative execution plans.
- Validation agents: synthesize custom validator prompts and coordinate LLM-based evaluation of sampled pipeline outputs, generating plan quality signals that guide search and pruning.
2. Agent-Based Logical and Physical Pipeline Rewriting
DocETL introduces an agent-driven approach to query plan optimization. Key to this is the deployment of a suite of “rewrite directives”, which represent high-level patterns (abstract rules) for decomposing, reorganizing, or augmenting pipeline fragments. Thirteen such directives are currently documented.
Logical rewriting examples include:
- Decomposition: Expanding a single map over a long document into a sequence such as Split → Gather → Map → Unnest → Reduce. This is essential when document length or complexity exceeds LLM attention window or degrades prompt fidelity.
- Gleaning directive: Map ⟹ Map → (Map_v → Map_i){≤k}, where after an initial mapping, a validator map provides feedback, and iterative refinements (up to k rounds) are applied to improve output quality.
- Duplicate key resolution: Rewrite patterns are invoked to standardize identifiers, e.g., to canonicalize entity mentions across document chunks.
Physical rewrite directives tune chunk size, batch size, order of unnest and aggregation, and are adapted based on empirical plan performance.
Agents do not commit to a single transformation; instead, they enumerate and synthesize alternative logical–physical combinations, each evaluated downstream.
3. Plan Optimization and Evaluation Framework
Distinct from the cost-oriented optimization of typical LLM pipelines, DocETL’s plan selection is driven by output accuracy and quality metrics. The optimization process has three stages:
- Candidate enumeration: The system traverses the initial pipeline, applying rewrite directives and generating candidate sub-plans recursively.
- Evaluation via validation agents: For each candidate plan, a validator agent crafts a task-specific prompt to elicit (from an LLM) a score, pairwise comparison, or graded assessment of sampled outputs.
- Selection and replacement: The system aggregates evaluation results, selects the empirically best-performing plan, and substitutes it into the overall pipeline.
This opportunistic, empirical approach is reminiscent of the Cascades framework but focuses on maximizing plan output quality (e.g., F1, recall, error reduction), sometimes at the expense of cost or latency.
4. Logical Rewriting and LLM-Oriented Decomposition Techniques
A foundational element of DocETL’s accuracy gains is its systematic decomposition of LLM-intensive operators. For instance, a document-spanning extraction task is logically rewritten as
Each phase targets distinct weaknesses of LLMs: context window overflow (Split), loss of relevant context (Gather), prompt complexity (Map), handling of list-structured outputs (Unnest), and need for aggregation (Reduce).
These decompositions are not heuristic defaults; they are agent-generated, plan-space alternatives subject to empirical evaluation, enabling DocETL to localize prompt errors, reduce hallucinations, and improve recall or precision where monolithic LLM calls routinely fail.
5. Accuracy Improvements and Performance Benchmarks
DocETL’s evaluation framework demonstrates consistent output quality improvements over previous systems. Across multiple benchmarks:
- Legal contract analysis (CUAD): ~25% F1 and up to 54% recall improvement over baseline LLM extraction and prior frameworks (LOTUS, Palimpzest).
- Video game review mining: Reduction in LLM hallucination rate by ~33% (relative), and a 34% gain in temporal extraction ordering (Kendall’s Tau).
- Biomedical adverse drug event classification: 33–50% improvement in rank-precision (RP@5, RP@10) compared with LOTUS.
- Police misconduct records: Cases of 1.34× greater accuracy; key metric increases from 67% to 92% data completeness.
These improvements derive from agentic pipeline rewriting and targeted validation, confirming the significance of the framework’s approach to accuracy-centric optimization.
6. Representative Applications and Domain Coverage
DocETL’s operator and rewriting primitives are domain-agnostic and have been evaluated in tasks including:
- Contractual clause extraction in long legal documents.
- Entity and key event resolution (entity deduplication, date and location normalization) in law enforcement and governmental records.
- Biomedical literature mining for adverse event extraction and multi-label classification.
- Structured aggregation of multi-chunk scientific, real estate, or consumer review documents.
The system supports both conventional extraction (e.g., span collection) and higher-order operations (duplication resolution, aggregation, annotation), with declarative expressiveness and domain-adaptive LLM prompt handling.
7. Future Research Directions
Planned advances for DocETL include:
- Richer declarative language constructs (advanced joins, filtering, projection synthesis directives).
- Further agentic integration of human-in-the-loop signals, enabling custom validator prompt engineering and override.
- Cost–accuracy modeling and physical parameter optimization (symbolic chunk sizing, adaptive batching).
- Broader scalability via refined blocking strategies in resolve/join operators.
- Integration of provenance tracking and interactive visual pipeline construction tools.
- Expanding LLM model support, including open-source alternatives to further reduce cost, and tighter coupling with domain-adaptive validation modules.
A plausible implication is that, as declarative agentic frameworks (including Palimpzest and Lotus) proliferate, DocETL’s agentic rewriting and validation-driven optimization model will become increasingly significant for accuracy-critical document processing in research and enterprise environments.
Table: Core Innovations and Features of DocETL
Feature | Description | Benefit |
---|---|---|
Agent-based rewrite directives | Abstract, flexible plan expansion/generation mechanisms | Empirical adaptation to LLM weaknesses |
Validation-driven plan selection | LLM-powered output quality scoring via validator agents | Prioritizes accuracy in pipeline search |
Modular, declarative YAML DSL | Pipeline apropos of map, split, gather, join, reduce ops | End-user transparency and flexibility |
Logical–physical decomposition | Chunking, context gathering, iterative improvement | Reduces prompt error, mitigates context drop |
Domain-agnostic operator set | Supports law, biomedicine, government, review mining | Broad application spectrum |
References
- DocETL: Agentic Query Rewriting and Evaluation for Complex Document Processing (Shankar et al., 16 Oct 2024)
- PalimpChat: Declarative and Interactive AI analytics (Liu et al., 5 Feb 2025)