Papers
Topics
Authors
Recent
2000 character limit reached

SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models (2503.00137v1)

Published 28 Feb 2025 in cs.CL

Abstract: Typical evaluations of LLMs report a single metric per dataset, often representing the model's best-case performance under carefully selected settings. Unfortunately, this approach overlooks model robustness and reliability in real-world applications. For instance, simple paraphrasing of prompts on the MMLU-Pro dataset causes accuracy fluctuations of up to 10\%, while reordering answer choices in the AGIEval dataset results in accuracy differences of up to 6.1\%. While some studies discuss issues with LLM robustness, there is no unified or centralized framework for evaluating the robustness of LLMs. To address this gap and consolidate existing research on model robustness, we present SCORE ($\mathbf{S}$ystematic $\mathbf{CO}$nsistency and $\mathbf{R}$obustness $\mathbf{E}$valuation), a comprehensive framework for non-adversarial evaluation of LLMs. The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency. We release the code publicly and start an LLM robustness leaderboard to facilitate further development and research.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.