- The paper introduces COVERBENCH as a novel benchmark that challenges language models with complex, multi-step claim verification across diverse domains.
- The methodology standardizes varied datasets into a unified schema and employs manual vetting to ensure low label noise and high quality.
- Experimental results reveal that state-of-the-art models score below 65 Macro-F1, emphasizing substantial room for improvement in claim verification.
A Challenging Benchmark for Complex Claim Verification
Introduction
The paper "A Challenging Benchmark for Complex Claim Verification" introduces a new benchmark designed to evaluate the performance of LMs in verifying the correctness of complex claims. The benchmark, known as COVERBENCH, is aimed at testing LMs' ability to handle intricate reasoning tasks across various domains. The key motivation behind this paper is to advance the development of models that not only perform complex reasoning but also accurately verify the correctness of their generated outputs, specifically claims.
Background and Significance
The paper builds on prior work that has focused on evaluating the correctness of LM outputs. Specifically, tasks such as fact-checking and natural language inference (NLI) have been central to this domain. The authors argue that verifying complex claims, which involves multiple steps of reasoning and domain-specific knowledge, is a unique challenge that necessitates a dedicated benchmark. This benchmark is designed to include diverse types of reasoning and is vetted for quality to ensure low levels of label noise.
Construction of COVERBENCH
Variety and Complexity
COVERBENCH is constructed using a combination of nine datasets from different domains, including finance, Wikipedia, biomedical, legal, and others. The datasets chosen exhibit various sources of complexity such as structured data reasoning, long-context reasoning, quantitative reasoning, domain expertise, and multi-step reasoning. The goal is to provide a diversified evaluation environment that can robustly challenge current LMs.
Benchmark Scope and Sources of Complexity
The authors meticulously selected these datasets to cover a wide range of reasoning tasks. For instance, FinQA and MultiHiertt from the finance domain involve multi-step quantitative reasoning. Similarly, datasets like HybridQA and Feverous combine textual and tabular data, requiring multi-hop reasoning and quantitative analysis. This careful curation ensures that the benchmark is comprehensive and covers diverse reasoning scenarios.
Methodology
Conversion and Sampling
To standardize the evaluation format, the authors converted all tasks into a unified schema comprising declarative claims, metadata on reasoning types, and standardized tabular representations. They used models to generate difficult negative examples by comparing model-generated answers against gold answers, ensuring the examples were representative of real-world model errors.
Manual Vetting and Selection
Manual inspection was employed to validate the data and ensure robustness against noise and data contamination. This step was crucial to maintain the integrity of the benchmark and avoid biases introduced by incorrect labeling. The authors discarded examples from datasets that did not meet their quality criteria during this phase.
Experimental Results
The experiments conducted demonstrate that COVERBENCH is indeed challenging for current state-of-the-art models. For example, competitive LMs like Llama-2-70b and Qwen2-72B-Instruct showed significant performance gaps, with the best models achieving below 65 Macro-F1 scores. This indicates substantial headroom for improvement in complex claim verification tasks.
The authors report a variety of baseline performances using 0-shot and 0-shot Chain-of-Thought (CoT) prompting techniques. The results indicate that, despite extensive prompt engineering, current models struggle to perform significantly above the random baseline. This showcases the benchmark's difficulty and the need for better strategies in complex claim verification.
Implications and Future Work
The introduction of COVERBENCH has significant implications for both theoretical and practical advancements in AI. From a theoretical perspective, it pushes the boundaries of what LMs can achieve in complex reasoning tasks. Practically, it provides a robust evaluation framework that can guide the development of more accurate and reliable LMs, especially in high-stakes domains like finance and healthcare.
Future Directions
Future developments could include exploring specialized LMs tailored for specific domains and enhancing models' reasoning capabilities with improved training techniques. Additionally, as data contamination remains a persistent challenge, developing more secure and reliable mechanisms for contamination detection and prevention will be crucial.
Conclusion
COVERBENCH stands as a substantial benchmark designed to evaluate the performance of LMs in verifying the correctness of complex claims. Its diverse and challenging nature highlights the current limitations of existing models and sets the stage for future advancements in the field. The systematic approach taken by the authors in constructing and validating this benchmark showcases its potential to serve as a critical tool for researchers and developers aiming to build more robust AI systems.
Limitations
The authors acknowledge the limitations regarding the use of off-the-shelf LMs and the potential for data contamination. They emphasize the need for specialized models for specific domains and call for further research into effective techniques for managing data integrity and avoiding contamination in future evaluations.
References
The paper provides extensive references to related works and datasets, including FinQA, QRData, TabFact, MultiHiertt, HybridQA, ContractNLI, PubMedQA, TACT, and Feverous, establishing a solid foundation for the presented research.