Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators (2404.04475v1)

Published 6 Apr 2024 in cs.LG, cs.AI, cs.CL, and stat.ML
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Abstract: LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for chat LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?". To achieve this, we first fit a generalized linear model to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. We release the code and leaderboard at https://tatsu-lab.github.io/alpaca_eval/ .

Debiasing Automated Evaluation Metrics for LLMs with a Focus on Length Bias

Introduction

The development and refinement of NLP systems hinge significantly on reliable and cost-effective evaluation methodologies. While reference-free evaluation methods leveraging LLMs have gained traction due to their alignment with human annotator preferences, a notable drawback is their susceptibility to spurious correlations such as output length. This paper addresses the challenge of debiasing automated evaluation metrics through a novel regression analysis approach, with a case paper on AlpacaEval, a prominent benchmark for chat LLMs. The paper introduces a length-controlled version of AlpacaEval, demonstrating not only a reduction in length bias but also an improvement in correlation with human preferences as benchmarked by LMSYS' Chatbot Arena.

Context and Problem Statement

Reference-free evaluation metrics, particularly those utilizing LLMs like AlpacaEval, often inadvertently favor certain spurious correlations, thereby compromising the accuracy of model evaluations. AlpacaEval, despite its high correlation with human judgments, displays a marked length bias, influencing its evaluation outcomes based on the verbosity of model outputs. The problem lies in existing automated evaluation metrics' failure to isolate the quality of content from such confounders, prompting a need for a debiasing solution that can enhance both the reliability and robustness of these metrics.

Methodology: Length-Controlled AlpacaEval

The proposed solution to debias AlpacaEval involves applying regression-based adjustments to control for the length bias. By conceptualizing spurious correlates as mediators in a causal relationship between model outputs and evaluation scores, the paper employs a generalized linear model (GLM) to isolate and mitigate the effect of output length on evaluation outcomes. This approach not only addresses the identified bias but also preserves desirable properties of an ideal evaluation metric, such as interpretability, symmetry, and the identity property. The length-controlled AlpacaEval thus provides a more accurate reflection of model performance, devoid of the skewness introduced by output length disparities.

Findings and Contributions

The implementation of the length-controlled AlpacaEval yielded several key insights:

  • Reduction in Length Gameability: The debiased metric significantly diminishes the influence of prompting models for verbosity, resulting in a more stable and reliable evaluation that prioritizes content quality over stylistic factors.
  • Enhanced Correlation with Human Preferences: The paper reports an increase in Spearman correlation with Chatbot Arena, from 0.94 to 0.98, signifying a closer alignment with human judgments and solidifying length-controlled AlpacaEval as the most accurate benchmark amongst known competitors in this aspect.
  • Increased Robustness: The adjusted metric proved to be more resistant to manipulations aimed at exploiting the length bias, further affirming its efficacy as a reliable evaluation tool.

Implications and Future Directions

The research underscores the importance of addressing biases in automated evaluation metrics for NLP systems, spotlighting length bias as a pervasive issue that can significantly skew evaluations. By presenting a scalable and reproducible method for debiasing such metrics, the paper opens avenues for more nuanced and accurate evaluations of LLMs. Future investigations could extend this debiasing approach to other known biases, potentially revolutionizing the way model performances are gauged and interpreted in varied contexts.

Conclusion

This work contributes significantly to the field of NLP by providing a practical solution to mitigate the length bias in AlpacaEval, thereby enhancing the accuracy and reliability of automated evaluation metrics. The approach is not only beneficial for the ongoing development and testing of LLMs but also sets a precedent for addressing other types of biases inherent in automated evaluation systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319, 2024.
  2. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
  3. Viet Hoang Tran Duong. Length-balanced alpacaeval 2.0, 2024. URL https://github.com/tatsu-lab/alpaca_eval/issues/225#issue-2115462149.
  4. Spurious correlations in reference-free evaluation of text generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1443–1454, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.102. URL https://aclanthology.org/2022.acl-long.102.
  5. The rating of chessplayers: Past and present. 1978.
  6. Balazs Galambosi. Advanced length-normalized alpacaeval 2.0, 2024. URL https://github.com/tatsu-lab/alpaca_eval/issues/225#issuecomment-1942201420.
  7. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
  8. Causal inference, 2010.
  9. Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012, 2023.
  10. Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9332–9346, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.750. URL https://aclanthology.org/2020.emnlp-main.750.
  11. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  12. Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024. URL https://huggingface.co/spaces/allenai/WildBench.
  13. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267–300, June 2013. doi: 10.1162/COLI_a_00123. URL https://aclanthology.org/J13-2002.
  14. Why we need new evaluation metrics for NLG. In Empirical Methods in Natural Language Processing (EMNLP), 2017.
  15. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  16. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024.
  17. Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X.
  18. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. arXiv preprint arXiv:2310.05199, 2023.
  19. A long way to go: Investigating length correlations in rlhf, 2023.
  20. Learning an unreferenced metric for online dialogue evaluation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2430–2441, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.220. URL https://aclanthology.org/2020.acl-main.220.
  21. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  22. Teortaxes. Length-normalized alpacaeval 2.0, 2024. URL https://x.com/teortaxesTex/status/1750495017771176301?s=20.
  23. Tyler VanderWeele. Explanation in causal inference: methods for mediation and interaction. Oxford University Press, 2015.
  24. Tyler J. VanderWeele. Controlled direct and mediated effects: Definition, identification and bounds. Scandinavian Journal of Statistics, 38, 2010. URL https://api.semanticscholar.org/CorpusID:12046639.
  25. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  26. Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025, 2023.
  27. A comprehensive assessment of dialog evaluation metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pp.  15–33, Online, November 2021. Association for Computational Linguistics.
  28. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  29. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685, 2023. URL https://api.semanticscholar.org/CorpusID:259129398.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yann Dubois (16 papers)
  2. Balázs Galambosi (1 paper)
  3. Percy Liang (239 papers)
  4. Tatsunori B. Hashimoto (23 papers)
Citations (200)