Fine-Grained Evidence Extraction

Updated 3 December 2025

Fine-grained evidence extraction is a technique that precisely identifies and segments minimal evidence units from text and visual data to support claims.
It leverages detailed annotation schemas and advanced models like encoder–decoder LLMs and hypergraph neural networks to ensure high accuracy and transparency.
Applications include fact-checking, machine reading comprehension, and visual reasoning, improving traceability and decision-making across various domains.

Fine-grained evidence extraction is the process of identifying, segmenting, and attributing the minimal, salient units within input data—whether text or visual—that directly support or refute specific claims, answer questions, or form the basis for accurate classification and reasoning. Unlike coarse-grained extraction, which often targets broad entities or relations using generic instructions, fine-grained methods delineate atomic spans, phrases, bounding boxes, or micro-level evidence that align precisely with defined semantic or task-specific criteria. This granular approach is critical in domains requiring high traceability and transparency such as fact-checking, machine reading comprehension (MRC), scientific information extraction, and visual reasoning.

1. Foundations and Definitions

Fine-grained evidence extraction encompasses both linguistic and visual modalities. In text-centric tasks, evidence refers to contiguous token spans $\{s_1, ..., s_M\}$ within a document $d$ that support or refute a claim $c$ , as defined in (Jarolím et al., 26 Nov 2025). The extraction target is not just “entities” but the exact text fragments—frequently at sub-phrase, phrase, or sentence level—minimally sufficient for justification. For images, evidence is defined as spatial regions (often pixel-level or bounding boxes $B = [x_{\min}, y_{\min}, x_{\max}, y_{\max}]$ ) summing to a tiny fraction of the overall visual field, with VER-Bench reporting a mean area ratio of $0.25\%$ for visual clues (Qiang et al., 6 Aug 2025).

The scope of fine granularity varies by application:

Information Extraction (IE): Separate each type into standalone tasks—e.g., entity mention spans, event triggers, aspect-opinion pairs—each with dedicated extraction rules and output schemas (Gao et al., 2023).
Factual Verification: Atomic claims are first generated via LLM prompting and then checked for supported/refuted micro-spans in retrieval-based architectures (Boonsanong et al., 19 Mar 2025, Jarolím et al., 26 Nov 2025).
Reading Comprehension: MUGEN splits evidence into phrase, fragment, sentence, and passage levels; fine-grained evidence corresponds to high-correlation noun/verb phrases (Zhao et al., 2023).
Scholarly Extraction: GSAP-ERE annotates 10 entity types and 18 relation categories, spanning model architectures, training/evaluation relationships, and data provenance (Otto et al., 12 Nov 2025).

2. Task Design: Schemas, Instructions, and Annotation

Fine-grained evidence extraction frameworks are built on meticulously designed schemas and instruction sets:

Augmented Instructions (IE): Each information type—e.g., Person entity, Transaction event—is paired with four instruction components: a concise task description, precise extraction rules (possibly in LaTeX notation), output format templates (e.g., JSON arrays), and in-context examples (Gao et al., 2023). The rules may formalize constraints such as:

$\mathrm{Trigger} = \arg\max_{s \subseteq X} \mathbb{I}[\mathrm{label}(s) = \text{EventTrigger}]$

Annotation Protocols: GSAP-ERE (Otto et al., 12 Nov 2025) mandates manual curation for 63K entity mentions and 35K relation instances across 100 ML papers, ensuring high inter-annotator agreement (NER exact match macro-F1 = 0.82, RE+ micro-F1 up to 80.1% for Data Properties).
Schema Extensions: In biomedical PICO NER, fine attributes such as arm-specific sample sizes, age, eligibility, as well as outcome measure types, are annotated. Revised schemas merge or split categories to optimize extraction granularity (Chen et al., 26 Dec 2024).
Fact-Checking Annotation: For claim-document pairs, evidence spans are independently highlighted by annotators under guidelines specifying minimally sufficient, contiguous subsequences (Jarolím et al., 26 Nov 2025).

3. Model Architectures and Extraction Pipelines

Approaches to fine-grained evidence extraction vary by the underlying model and modality:

Encoder–Decoder LLMs: Models such as T5 and FLAN-T5 excel in generalizing to unseen types and instructions in fine-grained IE benchmarks, outperforming BLOOM and LLaMA variants in exact span-extraction tasks (Gao et al., 2023). ChatGPT demonstrates notable adaptability to novel forms of extraction.
Supervised PLMs: GSAP-ERE’s joint modeling via HGERE leverages a hypergraph neural network (HGNN) for simultaneous NER and relation extraction, optimizing joint loss:

$L = \alpha\,L_{NER} + \beta\,L_{RE}$

Performance metrics: NER F1 = 80.6%, RE F1 = 54.0%, with significant gains over LLM-prompted methods (RE+ F1 as low as 10.1% for Qwen 2.5 72B) (Otto et al., 12 Nov 2025).

Semi-supervised Entity Recognition: FinePICO uses BiomedBERT with iterative pseudo-labeling (confidence-based, class-adaptive, or GPT-validated), merging unlabeled data for self-training (Chen et al., 26 Dec 2024).
Evidence Alignment for Fact-Checking: Extraction proceeds via basic instruct-prompting demanding verbatim substring extraction, with post-processing (Hungarian matching, stop-word removal) to reconcile model outputs with ground truth (Jarolím et al., 26 Nov 2025).
Visual Evidence Inference: MUGEN splits passages into coarse fragment evidence, then further divides into high-correlation phrases via ALBERT embeddings, with gating ( $s_i > \theta s_{\max}$ , default $\theta=0.8$ ) and fusion into final decision vectors (Zhao et al., 2023). Guided Zoom applies CAM/Grad-CAM to produce saliency maps, grounding evidence in image patches most responsible for class prediction (Bargal et al., 2018).

4. Quantitative Evaluation and Metrics

Fine-grained extraction quality is evaluated using span-level, token-level, or region-level metrics:

Token-Level F1: Precision and recall calculated over predicted and annotated token sets after alignment, common in fact-checking datasets (Jarolím et al., 26 Nov 2025):

$\text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$

Hungarian Matching: Optimal pairing of extracted and annotated spans to maximize aggregate F1.
Micro-Averaged F1: Used in GSAP-ERE across NER and RE tasks; exact and partial span overlap metrics provided (Otto et al., 12 Nov 2025).
Visual Evidence Benchmarks: VER-Bench aggregates accuracy via four GPT-judged axes—Answer Correctness (AC), Clue Coverage (CC), Reasoning Quality (RQ), Evidence-Answer Relevance (ER)—with area ratios quantifying fine-grained clue saliency (mean 0.25%) (Qiang et al., 6 Aug 2025).
Empirical Impact: In MUGEN, phrase evidence integration increases average accuracy by 0.4–0.7 points over fragment-only baselines (Zhao et al., 2023). Guided Zoom’s Top-1 accuracy rises by 1.6–3.1 pp over strong ResNet baselines in fine-grained visual classification (Bargal et al., 2018).

5. Error Analysis and Model Insights

Consistent observations emerge regarding limitations of LLMs and extraction pipelines:

Verbatim Fidelity: LLMs often paraphrase or hallucinate evidence rather than copying exact substrings, leading to invalid outputs (up to 61.8% in mixtral:8×7B) (Jarolím et al., 26 Nov 2025).
Boundary Detection: Errors are attributed to ambiguous span boundaries, misclassification, or missing low-frequency tags in NER tasks (Chen et al., 26 Dec 2024).
Model Scale Effects: Extraction fidelity plateaus or even worsens at large parameter counts (>120 B), underscoring the importance of architecture and explicit span-tuning over sheer size (Jarolím et al., 26 Nov 2025).
Visual Reasoning Gaps: Open-source MLLMs perform up to 15 points worse than closed-source on VER-Bench; clue coverage is tightly coupled to answer correctness, with small object localization remaining a challenge (Qiang et al., 6 Aug 2025).

6. Application Domains and Use Cases

Fine-grained extraction techniques underpin multiple downstream and research applications:

Fact-Checking: Alignment of claims to minimal evidence spans ensures traceability in automated fact verification pipelines. Structured decoding and prompt engineering are key to ensuring evidence fidelity (Jarolím et al., 26 Nov 2025, Boonsanong et al., 19 Mar 2025).
Scientific Knowledge Graphs: GSAP-ERE facilitates mining of model–dataset–method–result triples, enabling reproducibility monitoring, leaderboard curation, and document-grounded QA within ML research (Otto et al., 12 Nov 2025).
Machine Reading Comprehension: Multi-grain fusion (passage, sentence, fragment, phrase) underpins state-of-the-art performance in multi-choice MRC settings (Zhao et al., 2023).
Medical Evidence Synthesis: Fine-grained PICO NER supports clinical trial analysis and meta-paper curation, especially under low-resource annotation regimes (Chen et al., 26 Dec 2024).
Visual Understanding: Benchmarks like VER-Bench and Guided Zoom clarify evidence-based reasoning, patch-specific classification, and region-level attribution within visual tasks (Qiang et al., 6 Aug 2025, Bargal et al., 2018).

7. Future Directions and Recommendations

Research identifies several promising avenues:

Machine-Readable Instruction Formats: Integration of formal rule encodings (regex, finite-state constraints) with hash-maps or schema-based outputs could systematize fine-grained extraction (Gao et al., 2023).
Pointer/Grammar-Constrained Decoding: To address LLM hallucination and non-verbatim errors, designing outputs that are strict substrings via pointer networks or grammar constraints is advised (Jarolím et al., 26 Nov 2025).
Hybrid Training Regimes: Combining few gradient-based fine-tuning steps with in-context learning may boost adaptability in truly novel tasks (Gao et al., 2023).
Contextual Section Classification: For biomedical NER, leveraging sentence-section classifiers and expanding context windows can resolve ambiguities and reduce incorrect span tagging (Chen et al., 26 Dec 2024).
End-to-End Visual Reasoning: Fusion of region proposal with LLM attention and hybrid evaluation (IoU + reasoning criteria) may elevate fine-grained object and clue extraction (Qiang et al., 6 Aug 2025).
Data Augmentation: Continued scaling of annotated samples linearly increases F1 gains in fine-grained NER tasks (Chen et al., 26 Dec 2024).
Generalization: Schema and extraction paradigms are portable to domains outside ML and biomedicine, including chemical, legal, and social science corpora (Otto et al., 12 Nov 2025).

Fine-grained evidence extraction represents a pivotal step toward robust, transparent, and context-aware information extraction and decision-making across linguistically and visually rich domains, with ongoing research emphasizing the synergy of explicit instruction, architectural choices, and domain-specific curation.