Claimify: LLM-driven Claim Extraction
- Claimify is an LLM-driven framework for automated extraction and verification of factual claims, emphasizing precise handling of ambiguities and decontextualization.
- It employs a rigorous four-stage pipeline—sentence splitting, selection, disambiguation, and decomposition—to achieve high metrics such as 99% entailment and 93.5% sentence-level F1.
- The system integrates seamlessly into broader fact-checking solutions, enabling end-to-end automated processing and replicable evaluation via configurable LLM prompts.
Claimify is an LLM-driven framework for the extraction and verification of factual claims across diverse document genres, with specialized focus on precision, ambiguity handling, and explainability. Its design philosophy emphasizes end-to-end automation, modularity, evaluation rigor, and robustness in real-world fact-checking pipelines. Claimify has been operationalized both as a standalone extraction system and as a component in broader claim-verification solutions, serving as a benchmark for accuracy and decontextualization in claim processing (Metropolitansky et al., 15 Feb 2025).
1. Core Architecture and Workflow
Claimify implements a four-stage pipeline:
- Sentence Splitting & Contextualization: Given a question and LLM-generated multi-paragraph answer , sentences are generated using robust tokenization (NLTK with domain-specific post-processing). For each , a configurable window of preceding and following sentences is assembled as context for subsequent analysis.
- Selection (Verifiable Content Detection): For each context-enriched sentence, an LLM is prompted to categorize as non-verifiable, partially verifiable (with rewriting), or fully verifiable. Rewriting removes content not independently checkable, producing a minimal, fact-focused version of .
- Disambiguation (Ambiguity Handling): Each candidate sentence undergoes explicit disambiguation, where the LLM identifies and attempts to resolve referential and structural ambiguities (e.g., pronominal references, vague time expressions, scope ambiguities). Sentences with unresolved ambiguities are excluded, enforcing a high-confidence extraction regime.
- Decomposition (Claim Generation): Disambiguated sentences are further decomposed into atomized, standalone factual claims . Claimify supports context inference using bracketed notation for externally supplied information, yielding fully decontextualized claims.
The pipeline is fully automated via programmatic orchestration of LLM prompts; all hyperparameters (e.g., window sizes, temperature) are transparently documented (Metropolitansky et al., 15 Feb 2025).
2. Evaluation Metrics and Formal Framework
Claimify is evaluated along three principal axes: entailment, coverage, and decontextualization, with each operationalized through precise metrics.
- Entailment: The fraction of claims for which the source text entails 0, as assessed by an LLM/NLI model:
1
Claimify achieves 99% entailment equivalence with the top-performing baseline (VeriScore) (Metropolitansky et al., 15 Feb 2025).
- Coverage:
- Sentence-level: Treated as binary classification (claim present/absent per sentence). Standard metrics (accuracy, precision, recall, F1) apply.
- Element-level: Sentences are decomposed into atomic elements 2, independently labeled as verifiable/unverifiable by an LLM. Coverage is the proportion of verifiable elements covered in output claims, again with accuracy and F1 metrics.
- Claimify yields 93.5% sentence-level F1 and 91.9% element-level F1, significantly outperforming prior approaches (p < 0.001) (Metropolitansky et al., 15 Feb 2025).
- Decontextualization: For each claim 3, the maximally decontextualized 4 is compared via search-based evidence retrieval (using both Google and Bing). Seven outcome types capture whether evidence sets for 5 and 6 yield desirable verification properties. The key metric is:
7
Claimify attains 76.7% and 72.5% desirable outcomes on Google and Bing, respectively, exceeding all baselines (Metropolitansky et al., 15 Feb 2025).
3. Ambiguity Handling and Confidence Mechanisms
A distinguishing attribute of Claimify is its explicit ambiguity management. The system operationalizes confidence through the following processes:
- Ambiguity Types: Detects referential ambiguities (e.g., third-person pronouns, implicit actors), structural ambiguities (e.g., coordination scope), and distinguishes authorial inference from stated fact.
- Resolution Criterion: The LLM is instructed to resolve ambiguities exclusively via the current context window, enforcing consensus-based interpretations.
- Confidence Control: If any ambiguity is deemed unresolvable, the respective sentence is dropped from further claim generation. Only contextually unambiguous, directly interpretable content propagates.
This conservative filtering ensures high trustworthiness in the extracted claims and minimizes propagation of potentially misleading or unclear information (Metropolitansky et al., 15 Feb 2025).
4. Comparative Performance and Quantitative Analysis
Claimify outperforms state-of-the-art claim extraction baselines on all principal metrics:
| Method | Entailment (%) | Sentence Acc. (%) | Sentence F1 (%) | Element Acc. (%) | Element F1 (%) | Google DecontextScore (%) | Bing DecontextScore (%) |
|---|---|---|---|---|---|---|---|
| Claimify | 99 | 91.8 | 93.5 | 87.9 | 91.9 | 76.7 | 72.5 |
| VeriScore | 99 | 85.2 | 90.2 | 75.2 | 85.4 | 68.4 | 70.2 |
| DnD | 97 | 84.0 | 88.5 | 76.9 | 86.3 | 64.1 | 66.8 |
| SAFE | 94 | 82.5 | 87.0 | 73.8 | 84.0 | 61.5 | 63.2 |
Claimify’s advantage is especially pronounced in both coverage and decontextualization, critical metrics for robust end-to-end extraction (Metropolitansky et al., 15 Feb 2025).
5. Automation, Scalability, and Replicability
The Claimify pipeline is intrinsically automated:
- All extraction, ambiguity adjudication, and decomposition steps are driven by LLM-based prompts with no human intervention during inference.
- Evaluation scorings (coverage, decontextualization) are themselves automated via LLM judgments, greatly reducing annotation overhead.
- All prompts, pre-processing, and hyperparameters are fully documented, providing high replicability. Implementations relying on APIs such as GPT-4o are directly reproducible by external researchers (Metropolitansky et al., 15 Feb 2025).
- Context window parameters, ambiguity thresholds, and bracketed context inferences are user-configurable for task-dependent optimization.
This design supports batch and large-scale processing of long-form content and enables integration with upstream and downstream components (e.g., claim verification, evidence retrieval) in complex fact-checking pipelines.
6. Integration in End-to-End Fact-Checking and Related Ecosystems
Claimify’s high-precision claim extraction has been leveraged as a foundational module in broader claim verification architectures:
- ExClaim: Employs a modular pipeline—unsupervised rationale extraction (via BART-large-cnn), QA-framed verdict classification (T5), and natural language explanation assembly—achieving macro-F1 = 0.93 on Politifact evidence. Best practices recommend maintaining strict modularity, abstractive rationalization, and explainability auditing using XAI methods (e.g., SHAP, NLI audits) (Gurrapu et al., 2023).
- FinDVer: For financial claim verification over hybrid-content documents, robust extraction, chain-of-thought reasoning, and RAG-based retrieval are critical. Claimify’s modular design parallels the recommended approach for high accuracy in expert-domain claim evaluation; coverage, accuracy, and explanation quality remain primary benchmarks (Zhao et al., 2024).
- Multilingual Retrieval and Filtration: In multilingual environments with massive fact-check archives, Claimify (in adapted form) is integrated into retrieval, filtration (LLM-based), summarization, and veracity reasoning pipelines—demonstrating S@10 ≈ 0.84 and macro F1 ≈ 63% for three-way veracity classification (Vykopal et al., 29 Apr 2025).
Overall, Claimify’s architecture and evaluation protocol inform the development of explainable, scalable, and precise claim-processing solutions across a spectrum of domains and data modalities.
7. Practical Implications and Future Directions
The rigorous, conservative extraction regime adopted by Claimify mitigates error propagation in automated fact-checking workflows, a critical consideration when verifying LLM-generated content or high-stakes domain narratives. The integration of ambiguity resolution, strict decontextualization, and performance metrics sets a benchmark for the field.
Further research directions include extending Claimify to handle more complex claim structures (e.g., nested argumentative relations), adapting to broader language and document typologies, and integrating more granular user controls for context windowing and confidence gating. Cross-task synergy—combining Claimify’s extraction with best-in-class verification and explanation modules—remains a key avenue for future innovation (Metropolitansky et al., 15 Feb 2025, Gurrapu et al., 2023, Zhao et al., 2024, Vykopal et al., 29 Apr 2025).