Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models (2411.01281v3)

Published 2 Nov 2024 in cs.CL and cs.AI

Abstract: Most existing benchmarking approaches for evaluating the output quality of LLMs rely on comparing LLM responses to predefined references. Such methods, based on static datasets, quickly become outdated as LLM capabilities and use cases evolve. In this work, we introduce VARCO Arena--a novel, cost-effective, and robust benchmarking approach that leverages a single-elimination tournament structure to minimize the number of required comparisons while eliminating the need for static references or costly human annotations. We validate our approach through two experiments: (i) a simulation study that examines its robustness under various conditions, and (ii) an empirical evaluation using publicly available benchmark prompts. In both experiments, VARCO Arena consistently outperforms current LLM benchmarking practices, achieving stronger correlations with human-established Elo ratings. Our results demonstrate that VARCO Arena not only produces reliable LLM rankings but also provides a scalable, adaptable solution for qualitative evaluation across diverse, customized use cases.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models (2411.01281v3)

Summary

Related Papers