Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Abstract: LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for instruction-tuned LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?" To achieve this, we first fit a generalized linear model to predict the biased auto-annotator's preferences based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, but we also find that it increases the Spearman correlation with LMSYS Chatbot Arena from 0.94 to 0.98.
- Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319, 2024.
- Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
- Viet Hoang Tran Duong. Length-balanced alpacaeval 2.0, 2024. URL https://github.com/tatsu-lab/alpaca_eval/issues/225#issue-2115462149.
- Spurious correlations in reference-free evaluation of text generation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1443–1454, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.102. URL https://aclanthology.org/2022.acl-long.102.
- The rating of chessplayers: Past and present. 1978.
- Balazs Galambosi. Advanced length-normalized alpacaeval 2.0, 2024. URL https://github.com/tatsu-lab/alpaca_eval/issues/225#issuecomment-1942201420.
- Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
- Causal inference, 2010.
- Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012, 2023.
- Evaluating the factual consistency of abstractive text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9332–9346, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.750. URL https://aclanthology.org/2020.emnlp-main.750.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
- Wildbench: Benchmarking llms with challenging tasks from real users in the wild, 2024. URL https://huggingface.co/spaces/allenai/WildBench.
- Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267–300, June 2013. doi: 10.1162/COLI_a_00123. URL https://aclanthology.org/J13-2002.
- Why we need new evaluation metrics for NLG. In Empirical Methods in Natural Language Processing (EMNLP), 2017.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024.
- Judea Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X.
- Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. arXiv preprint arXiv:2310.05199, 2023.
- A long way to go: Investigating length correlations in rlhf, 2023.
- Learning an unreferenced metric for online dialogue evaluation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2430–2441, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.220. URL https://aclanthology.org/2020.acl-main.220.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Teortaxes. Length-normalized alpacaeval 2.0, 2024. URL https://x.com/teortaxesTex/status/1750495017771176301?s=20.
- Tyler VanderWeele. Explanation in causal inference: methods for mediation and interaction. Oxford University Press, 2015.
- Tyler J. VanderWeele. Controlled direct and mediated effects: Definition, identification and bounds. Scandinavian Journal of Statistics, 38, 2010. URL https://api.semanticscholar.org/CorpusID:12046639.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
- Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025, 2023.
- A comprehensive assessment of dialog evaluation metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pp. 15–33, Online, November 2021. Association for Computational Linguistics.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
- Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv, abs/2306.05685, 2023. URL https://api.semanticscholar.org/CorpusID:259129398.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.