CoverBench: A Challenging Benchmark for Complex Claim Verification (2408.03325v2)

Published 6 Aug 2024 in cs.CL

Abstract: There is a growing line of research on verifying the correctness of LLMs' outputs. At the same time, LMs are being used to tackle complex queries that require reasoning. We introduce CoverBench, a challenging benchmark focused on verifying LM outputs in complex reasoning settings. Datasets that can be used for this purpose are often designed for other complex reasoning tasks (e.g., QA) targeting specific use-cases (e.g., financial tables), requiring transformations, negative sampling and selection of hard examples to collect such a benchmark. CoverBench provides a diversified evaluation for complex claim verification in a variety of domains, types of reasoning, relatively long inputs, and a variety of standardizations, such as multiple representations for tables where available, and a consistent schema. We manually vet the data for quality to ensure low levels of label noise. Finally, we report a variety of competitive baseline results to show CoverBench is challenging and has very significant headroom. The data is available at https://huggingface.co/datasets/google/coverbench .

Citations (3)

View on Semantic Scholar

Summary

The paper introduces COVERBENCH as a novel benchmark that challenges language models with complex, multi-step claim verification across diverse domains.
The methodology standardizes varied datasets into a unified schema and employs manual vetting to ensure low label noise and high quality.
Experimental results reveal that state-of-the-art models score below 65 Macro-F1, emphasizing substantial room for improvement in claim verification.

A Challenging Benchmark for Complex Claim Verification

Introduction

The paper "A Challenging Benchmark for Complex Claim Verification" introduces a new benchmark designed to evaluate the performance of LMs in verifying the correctness of complex claims. The benchmark, known as COVERBENCH, is aimed at testing LMs' ability to handle intricate reasoning tasks across various domains. The key motivation behind this paper is to advance the development of models that not only perform complex reasoning but also accurately verify the correctness of their generated outputs, specifically claims.

Background and Significance

The paper builds on prior work that has focused on evaluating the correctness of LM outputs. Specifically, tasks such as fact-checking and natural language inference (NLI) have been central to this domain. The authors argue that verifying complex claims, which involves multiple steps of reasoning and domain-specific knowledge, is a unique challenge that necessitates a dedicated benchmark. This benchmark is designed to include diverse types of reasoning and is vetted for quality to ensure low levels of label noise.

Construction of COVERBENCH

Variety and Complexity

COVERBENCH is constructed using a combination of nine datasets from different domains, including finance, Wikipedia, biomedical, legal, and others. The datasets chosen exhibit various sources of complexity such as structured data reasoning, long-context reasoning, quantitative reasoning, domain expertise, and multi-step reasoning. The goal is to provide a diversified evaluation environment that can robustly challenge current LMs.

Benchmark Scope and Sources of Complexity

The authors meticulously selected these datasets to cover a wide range of reasoning tasks. For instance, FinQA and MultiHiertt from the finance domain involve multi-step quantitative reasoning. Similarly, datasets like HybridQA and Feverous combine textual and tabular data, requiring multi-hop reasoning and quantitative analysis. This careful curation ensures that the benchmark is comprehensive and covers diverse reasoning scenarios.

Methodology

Conversion and Sampling

To standardize the evaluation format, the authors converted all tasks into a unified schema comprising declarative claims, metadata on reasoning types, and standardized tabular representations. They used models to generate difficult negative examples by comparing model-generated answers against gold answers, ensuring the examples were representative of real-world model errors.

Manual Vetting and Selection

Manual inspection was employed to validate the data and ensure robustness against noise and data contamination. This step was crucial to maintain the integrity of the benchmark and avoid biases introduced by incorrect labeling. The authors discarded examples from datasets that did not meet their quality criteria during this phase.

Experimental Results

The experiments conducted demonstrate that COVERBENCH is indeed challenging for current state-of-the-art models. For example, competitive LMs like Llama-2-70b and Qwen2-72B-Instruct showed significant performance gaps, with the best models achieving below 65 Macro-F1 scores. This indicates substantial headroom for improvement in complex claim verification tasks.

Baseline Performance

The authors report a variety of baseline performances using 0-shot and 0-shot Chain-of-Thought (CoT) prompting techniques. The results indicate that, despite extensive prompt engineering, current models struggle to perform significantly above the random baseline. This showcases the benchmark's difficulty and the need for better strategies in complex claim verification.

Implications and Future Work

The introduction of COVERBENCH has significant implications for both theoretical and practical advancements in AI. From a theoretical perspective, it pushes the boundaries of what LMs can achieve in complex reasoning tasks. Practically, it provides a robust evaluation framework that can guide the development of more accurate and reliable LMs, especially in high-stakes domains like finance and healthcare.

Future Directions

Future developments could include exploring specialized LMs tailored for specific domains and enhancing models' reasoning capabilities with improved training techniques. Additionally, as data contamination remains a persistent challenge, developing more secure and reliable mechanisms for contamination detection and prevention will be crucial.

Conclusion

COVERBENCH stands as a substantial benchmark designed to evaluate the performance of LMs in verifying the correctness of complex claims. Its diverse and challenging nature highlights the current limitations of existing models and sets the stage for future advancements in the field. The systematic approach taken by the authors in constructing and validating this benchmark showcases its potential to serve as a critical tool for researchers and developers aiming to build more robust AI systems.

Limitations

The authors acknowledge the limitations regarding the use of off-the-shelf LMs and the potential for data contamination. They emphasize the need for specialized models for specific domains and call for further research into effective techniques for managing data integrity and avoiding contamination in future evaluations.

References

The paper provides extensive references to related works and datasets, including FinQA, QRData, TabFact, MultiHiertt, HybridQA, ContractNLI, PubMedQA, TACT, and Feverous, establishing a solid foundation for the presented research.