Papers
Topics
Authors
Recent
Search
2000 character limit reached

DoclingAgent: Tool-Augmented Information Extraction

Updated 19 April 2026
  • DoclingAgent is a stateful, tool-augmented system designed for extracting key-value pairs and answering queries from diverse, semi-structured regulatory documents.
  • It employs a dynamic planner–executor–responder control loop alongside a modular tool registry to adapt extraction processes based on document modality and prevent execution loops.
  • The system achieves high JSON validity with improved key match and value similarity metrics while incorporating safeguards to ensure traceable, verified extraction outputs.

DoclingAgent is a stateful, tool-augmented information extraction (IE) system designed to address the challenges of extracting structured data from highly variable, multilingual, and semi-structured regulatory documents, with an initial focus on EU Declaration of Performance (DoP) PDFs. It integrates a planner–executor–responder control loop with a modular tool registry, enabling adaptive orchestration across a wide diversity of document layouts, languages, and modalities. The architecture is engineered for robust, traceable extraction of key-value pairs (KVP) and question answering (QA), particularly emphasizing execution safeguards and verification to prevent hallucination and premature output (Colakoglu et al., 15 Sep 2025).

1. Architectural Overview

At its core, DoclingAgent employs a triadic control structure consisting of planner, executor, and responder modules. These communicate exclusively through two central data constructs:

  • AgentState: A Pydantic-style mutable object encapsulating all contextual state, including user_intent (either kvp_extraction or question_answering), document_metadata (covering source path, modality flags, and language codes), document_text (from direct extraction or OCR), outputs (extracted_kvps, qa_answers, verification_results), and full reasoning/tool histories.
  • AgentStatus: A finite-state control flag with possible values in {PLAN,NEED_TOOL,RESPOND,SUCCESS,END}\{\mathrm{PLAN}, \mathrm{NEED\_TOOL}, \mathrm{RESPOND}, \mathrm{SUCCESS}, \mathrm{END}\} governing which module is currently active.

User requests comprise a PDF and a specified extraction or QA task. The system serializes all inter-module communication as JSON, ensuring complete auditability and supporting downstream inspection and debugging. This architecture supports full separation of concerns: the planner decides what action to take; the executor runs tools and updates state; and the responder finalizes outputs only when verification criteria are met (Colakoglu et al., 15 Sep 2025).

2. Intent Inference and Document Modality Adaptation

Upon initiation, the planner enters the PLAN state. If the user’s intent is ambiguous in AgentState, the planner invokes the LLM-based “classify_intent” tool, prompting GPT-4o to infer if the task is KVP extraction or QA. The pipeline next evaluates document modality with the “check_if_scanned” tool, which inspects PDF metadata and attempts direct text extraction—if the output is empty or contains more than 30% non-printable characters, the file is marked as scanned.

Depending on modality, the planner dynamically chooses either “extract_text_direct” (for digital PDFs) or “extract_text_ocr” (for scanned PDFs), ensuring downstream operations automatically adapt to the inherent format of each input. This avoids brittleness typical of static pipelines or monolithic LLM approaches (Colakoglu et al., 15 Sep 2025).

3. Dynamic Tool Orchestration and Planning Logic

Atomic capabilities (translation, key-value decoding, verification, etc.) are managed via a uniform Tool abstraction. Every tool specifies:

  • name (e.g., extract_text_ocr)
  • intent (e.g., text_extraction, qa_answering)
  • compatible_pdf_types (e.g., ["scanned"], ["digital"])
  • input/output schemas (Pydantic classes)
  • optional pre/postprocessing callbacks

A global ToolRegistry, loaded at startup, unifies classic utilities (OCR, PDF parsing) with LLM-based modules (language identification, answering, verification). Each planning step filters candidates with

1
candidates = ToolRegistry.compatible_with(state, current_intent)

and constructs a prompt for GPT-4o, providing summaries of AgentState, each tool’s description, and explicit decision rules (e.g., avoid OCR extraction if the file is digital). GPT-4o deterministically (temperature=0) returns a JSON object dictating the next tool and inputs, or ends tool invocation when not needed.

Tool selection intentionally avoids formal cost functions or POMDP frameworks; loop prevention and fallback are encoded directly via rules in the prompt. The planner refuses to call the same tool with identical inputs, enforcing progression and preventing execution loops (Colakoglu et al., 15 Sep 2025).

4. Execution Safeguards and Output Verification

Upon entering NEED_TOOL, the executor locates the specified tool, normalizes the input to its schema, and executes the underlying process (OCR, parser, LLM, etc.). Postprocessing ensures consistent output formats. Each invocation records input, output, and outcome to tool_history, updating AgentState accordingly.

For loop prevention, the executor hashes the last tool invocation and input, making this history visible to the next planning prompt. If a redundant call is detected (tool + input identical to last), the planner transitions directly to RESPOND with an explicit reasoning message indicating loop avoidance.

The responder finalizes output only when either (a) for KVP extraction, 100% key coverage has been verified and verification_result indicates success; or (b) for QA, the answer is populated and verify_extraction confirms it is grounded in document_text. Otherwise, the responder issues a fallback notification and returns to PLAN, protecting against premature or hallucinated outputs. This guarded progression enforces both correctness and stability, improving over minimal-prompt or LLM-only baselines (Colakoglu et al., 15 Sep 2025).

5. Evaluation Metrics and Empirical Performance

Evaluation employs a curated DoP dataset comprising 52 PDFs (96% German, 85% digital text, remainder scanned), human-annotated for 12 top-level KVPs (with nested "Declared Performance" and "Signature" subtrees), yielding 4,230 KVP instances and 4,192 QA pairs (downsampled to 1,440 for evaluation).

Metrics include:

  • JSON validity per (doc, ℓ):

validd,={1if prediction parses as JSON 0otherwise\text{valid}_{d,\ell} = \begin{cases} 1 & \text{if prediction parses as JSON} \ 0 & \text{otherwise} \end{cases}

  • Key Match Ratio:

KeyMatchd,=KgtKpredKgt\text{KeyMatch}_{d,\ell} = \frac{|K_{\mathrm{gt}} \cap K_{\mathrm{pred}}|}{|K_{\mathrm{gt}}|}

  • Value similarity metrics: exact match (EM), BLEU, and ROUGE-L as detailed in the source:

EM(vgt,vpred)={1vgt=vpred 0else\mathrm{EM}(v_{\mathrm{gt}}, v_{\mathrm{pred}}) = \begin{cases} 1 & v_{\mathrm{gt}} = v_{\mathrm{pred}} \ 0 & \text{else} \end{cases}

DoclingAgent attains 100% JSON validity, KeyMatchRatio = 1.0 on fixed schema and approximately 0.56 on open schema, far exceeding minimal-prompt baselines. For value similarity, it improves over GPT-4o(T+S) by +70% EM, +69% BLEU, and +57% ROUGE in English, with similar improvements in German. QA exact match scores are 0.323 (en) / 0.494 (de) on flat and 0.370 / 0.455 on nested questions, outperforming all baselines in cross-lingual contexts. Remaining errors concentrate in open-schema fields and deeply nested key structures, traceable to minor mismatches in key-paths and structure (Colakoglu et al., 15 Sep 2025).

6. Limitations and Extensions

Several limitations are documented. Overly strict literal matching penalizes semantically correct variants such as orthographic or unit-format deviations; the authors recommend integrating fuzzy, schema-aware matching to improve recall. Current planning logic, although rich in explicit rules, is static; future work may leverage learned cost functions or confidence-driven self-reflection to proactively optimize tool selection. QA on nested document structures is notably challenging; incorporation of additional visual layout signals (such as table boundaries or relative field positioning) appears necessary for substantial gains.

The underlying framework—the tool registry, stateful planning loop, and dynamic modality adaptation—is not limited to DoP documents. It is directly portable to other regulated domains such as medical device certificates or financial disclosures. By introducing new schema templates, domain-specific parsing tools, or intent-classification prompts, the same agentic system can be rapidly bootstrapped for a wide spectrum of document-centric LLM use cases requiring structured, verifiable extraction (Colakoglu et al., 15 Sep 2025).

7. Context and Significance within Information Extraction

DoclingAgent demonstrates that robust IE in regulated domains benefits from explicit, auditable state tracking, dynamic multimodal adaptation, and modular tool orchestration—contrasting with monolithic LLM approaches that suffer from hallucination and limited adaptability to structural document diversity. Its architecture establishes that agentic, stateful systems can offer substantial improvements in validity, coverage, and cross-lingual accuracy, particularly for semi-structured, multilingual corpora. This suggests a promising blueprint for building resilient, verifiable IE pipelines in high-stakes domains where output correctness and auditability are paramount (Colakoglu et al., 15 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DoclingAgent.