MultiVerS: Scientific Claim Verification
- MultiVerS is a multitask neural architecture for scientific claim verification that jointly assigns document-level labels and selects sentence-level rationales.
- It leverages full-context encoding with the Longformer and global attention to accurately capture cross-sentence relationships and domain-specific cues.
- The model supports weak supervision and robust domain adaptation, yielding significant performance gains in zero-shot, few-shot, and fully supervised settings.
MultiVerS is a multitask neural architecture for scientific claim verification, designed to simultaneously predict document-level fact-checking labels ("Support", "Refute", "Not Enough Information") and select sentence-level rationales supporting those decisions, using a unified encoding of scientific claims and the full context of candidate abstracts. It addresses the limitations of prior extract-then-label pipelines by employing joint modeling, full-document context, and support for weak supervision (instances with only document-level labels), facilitating robust domain adaptation critical for scientific fact-checking (Wadden et al., 2021).
1. Scientific Claim Verification: Problem Overview
The scientific claim verification task requires a system to determine whether a published scientific document supports, contradicts, or lacks information about a given claim and to identify specific sentences ("rationales") that justify the prediction. Traditional extract-then-label systems rely on first selecting candidate evidence sentences (often losing relevant contextual cues such as acronyms, coreferences, or section-level qualifiers), followed by a separate label assignment step. This approach is limited by the need for full sentence-level supervision, expensive and often unavailable in biomedical or newly emerging domains.
MultiVerS introduces a model and training paradigm that jointly addresses rationale selection and claim labeling, allowing weakly supervised learning from document-level labels alone. This is achieved through a multitask neural architecture with a shared context-aware encoder.
2. Model Architecture and Multitask Learning
- Input Representation: For each claim–abstract pair, the input sequence concatenates the claim, abstract title, and abstract sentences, separated by dedicated sentence tokens:
- Encoder: The Longformer is used as the encoder, accommodating long contexts (important for scientific abstracts). Global attention is assigned to
<s>and each separator token, ensuring access to both claim and broader document context. - Joint Prediction Heads:
- Abstract Label: A softmax classification head over the
<s>token embedding outputs the claim verification label (Support/Refute/NEI). - Rationale Selection: Each sentence separator token receives a binary classification head. The model outputs a softmax score per sentence, predicting rationale status.
- Abstract Label: A softmax classification head over the
- Loss Function: The loss combines both tasks,
where is a task-weighting hyperparameter (set to 15).
- Inference Strategy: If the predicted label is NEI, no rationales are assigned. Otherwise, all sentences with softmax above 0.5 are selected as rationales; if none exist above threshold, the top-scoring sentence is assigned as the rationale.
A key aspect is that rationale selection is conditioned on the document-level prediction, ensuring contextually aligned evidence selection and mitigating the risk where rationale-only context might be insufficient.
3. Incorporation of Full-Document Context
Unlike extract-then-label baselines—which operate on individual sentences—MultiVerS encodes the entire abstract and claim together. This allows the model to resolve coreference, disambiguate acronyms, and effectively handle scientific modifiers or qualifiers expressed over sentence boundaries. Empirically, this design is highlighted as critical in challenging scenarios where the rationale alone fails to capture all necessary context.
4. Weak Supervision and Domain Adaptation
MultiVerS can train under both fully supervised (document and rationale annotations present) and weakly supervised (only document-level labels) regimes. For weak supervision, the model ignores the rationale loss for those instances. This enables effective use of:
- Datasets with only abstract-level labels: e.g., PubMedQA (claims from paper titles with abstract-level support labels, but no sentence-level rationales), EvidenceInference (conclusions mapped to claims), and Fever (out-of-domain fact verification).
During training, no heuristic or noisy rationale assignment is necessary for weakly labeled data—rationale loss is set to zero. In contrast, baseline systems must rely on heuristics (e.g., Sentence-BERT similarity) for rationale when none are annotated, which introduces considerable noise and impedes domain adaptation.
5. Experimental Results and Comparative Performance
Experiments are conducted on three core datasets:
- SciFact: Atomic biomedical claims from scholarly rewrites.
- HealthVer: COVID-related claims (complex, extracted from TREC-COVID).
- COVIDFact: Reddit-sourced claims requiring evidence from the CORD-19 corpus (with automated negations and increased noise).
Pretraining and weak supervision use Fever, PubMedQA, and EvidenceInference.
Summary of main results:
- MultiVerS consistently outperforms Vert5Erini (dual T5-3B, extract-then-label) and ParagraphJoint (joint model, less efficient under weak supervision) on both abstract-level and sentence-level metrics.
- Zero-shot domain adaptation: Average +26% F1 improvement (abstract and sentence) over baselines when no in-domain labeled data is available.
- Few-shot domain adaptation: +14% improvement with only 45 in-domain samples.
- Full supervision: +11% average improvement.
- Human agreement analysis: MultiVerS sometimes exceeds inter-annotator agreement for sentence rationale selection; less so on document label, suggesting potential for exceeding human-level evidence finding in difficult settings.
6. Ablations and Analysis
Ablation studies reveal:
- Domain-specific pretraining is crucial: Combining general fact verification (Fever) with scientific domain data is essential; omitting in-domain data causes 65% relative performance drop in zero-shot settings.
- Pipeline ablation: The joint multitask strategy (rationale conditioned on label) outperforms extract-then-label pipelines, especially in low-resource or cross-domain settings where heuristic rationale selection fails.
- Context dependence: Full-context encoding is superior for claims requiring cross-sentence reasoning. Baseline pipelines underperform when sentence context alone does not suffice.
7. Significance and Practical Implications
- Scalable domain adaptation: The explicit support for weak supervision allows rapid extension to new scientific domains with little labeled data.
- Robustness to annotation sparsity: Joint modeling avoids the need for labor-intensive sentence-level rationale annotation.
- Practical deployment: MultiVerS's performance is stable even as annotation cost or in-domain data volume sharply decrease, making it suited for rapidly emerging research fields or high-value biomedical verification tasks.
| Setting | MultiVerS F1 (Abstract/Sentence) | Baseline F1 | Relative Gain |
|---|---|---|---|
| Zero-shot | High | Lower | +26% |
| Few-shot (45 ex) | High | Lower | +14% |
| Full supervision | Highest | Lower | +11% |
8. Technical Summary Table
| Module | Description | Function |
|---|---|---|
| Input Encoder | Longformer, global attention on control tokens | Claim+abstract joint input |
| Doc Label Head | 3-class softmax over <s> embedding |
Support/Refute/NEI |
| Sentence Head | Softmax over each separator token | Rationale selection |
| Loss | Multitask, | |
| Weak supervision | Zero rationale loss if not available | Efficient domain adaptation |
9. Conclusion
MultiVerS defines the state-of-the-art for scientific claim verification and rationalization under both fully and weakly supervised settings. Its architecture—joint multitask learning over a document-contextualized encoder—enables consistent advances in accuracy, transferability, and efficiency across a range of scientific and biomedical fact-checking benchmarks. The system demonstrates that a unified, full-context approach is essential for robust claim verification in practice, particularly in data- and resource-constrained scientific domains (Wadden et al., 2021).