Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models (2402.13887v2)

Published 21 Feb 2024 in cs.CL
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models

Abstract: LLMs have demonstrated remarkable capabilities across various applications, fundamentally reshaping the landscape of NLP research. However, recent evaluation frameworks often rely on the output probabilities of LLMs for predictions, primarily due to computational constraints, diverging from real-world LLM usage scenarios. While widely employed, the efficacy of these probability-based evaluation strategies remains an open research question. This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs), highlighting their inherent limitations. Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction. Furthermore, current evaluation frameworks typically assess LLMs through predictive tasks based on output probabilities rather than directly generating responses, owing to computational limitations. We illustrate that these probability-based approaches do not effectively correspond with generative predictions. The outcomes of our study can enhance the understanding of LLM evaluation methodologies and provide insights for future research in this domain.

Beyond Probabilities: A Critical Examination of Evaluation Methods for LLMs

Introduction

As the field of NLP continues to expand, LLMs have taken center stage due to their unprecedented capabilities across a myriad of applications. The scalability of these models, often comprising billions to trillions of parameters, has been met with novel challenges in their evaluation. Traditional evaluation frameworks predominantly rely on probability-based methods to gauge LLM performance, especially in predictive tasks. These methods typically entail the selection of answers with the highest output probabilities from LLMs when confronted with Multiple Choice Questions (MCQs). This paper critically assesses the effectiveness of such evaluation practices in reflecting the true capabilities of LLMs, especially in scenarios mimicking real-world applications.

Evaluation Misalignment

Our investigation into current LLM evaluation methodologies exposes a significant gap between traditional probability-based evaluation strategies and the generative nature of LLM applications in practical settings. Notably, these evaluation frameworks often fail to accurately capture the essence of generative predictions, which constitutes a substantial portion of LLM use cases. Through extensive experimentation involving LLMs of varied sizes across prominent benchmarks such as MMLU, TruthfulQA, and Belebele, it is evident that there exists a disconnect between the outcomes of probability-based methods and generation-based predictions. Even when predictions are aligned, the inconsistency in the accuracy and alignment of the evaluated models’ performance is noteworthy. This disparity raises critical concerns regarding the reliability of conventional benchmarks reliant on probability-based evaluation methods for LLMs.

Challenges in Current Evaluation Practices

The existing evaluation methodologies face several challenges, including:

  • Scalability and Reproducibility: Human evaluations, though considered the gold standard, are not scalable and pose significant challenges in ensuring reproducibility and consistency across evaluators.
  • Dependency on Restricted Responses: MCQ-based evaluations limit LLMs to a constrained set of responses, potentially misrepresenting a model's generative capabilities in unrestricted, user-facing scenarios.
  • Discrepancy with Human Preferences: Our findings suggest that MCQ benchmarks may not accurately reflect human preferences, particularly in open-ended or creative tasks.

This critique underscores the urgent need for a paradigm shift in evaluating LLM capabilities, moving beyond probabilities and towards methods that encapsulate the generative and contextual richness of LLM outputs.

The Path Forward

In light of these findings, we advocate for a comprehensive reevaluation of LLM benchmarks and suggest the following directions for future research:

  • Development of Holistic Evaluation Frameworks: Future efforts should focus on crafting evaluation protocols that extend beyond traditional benchmarks to encompass a more diverse array of LLM capabilities, including free-text generation and contextual understanding.
  • Emphasis on Slow Research: By prioritizing a deeper understanding of LLM development over leaderboard-chasing, we can foster more robust and meaningful advancements in the field.
  • Alignment with Real-World Applications: Evaluation methods should strive to reflect the practical utility of LLMs, ensuring that advancements translate effectively into tangible benefits in real-world scenarios.

Conclusion

Our critical examination of current evaluation methods for LLMs reveals a fundamental misalignment with the practical utility of these models. This discord underscores the necessity for the development of more nuanced and reflective evaluation frameworks that account for the diverse and generative nature of LLM applications. As the field progresses, embracing these recommendations will be paramount in accurately charting the course of LLM advancements and their implications for both theoretical research and practical applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781.
  2. Palm 2 technical report. CoRR, abs/2305.10403.
  3. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861.
  4. Qwen technical report. CoRR, abs/2309.16609.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  6. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073.
  7. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants.
  8. Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP. In The Fourth Workshop on Insights from Negative Results in NLP, pages 1–10, Dubrovnik, Croatia. Association for Computational Linguistics.
  9. Non-repeatable experiments and non-reproducible results: The reproducibility crisis in human evaluation in NLP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3676–3687, Toronto, Canada. Association for Computational Linguistics.
  10. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  12. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
  13. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  14. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  15. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  16. A framework for few-shot language model evaluation.
  17. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  18. Human feedback is not gold standard. CoRR, abs/2309.16349.
  19. Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In Proceedings of the 13th International Conference on Natural Language Generation, pages 169–182, Dublin, Ireland. Association for Computational Linguistics.
  20. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322.
  21. Mistral 7b.
  22. Large language models only pass primary school exams in Indonesia: A comprehensive test on IndoMMLU. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12359–12374, Singapore. Association for Computational Linguistics.
  23. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011.
  24. CMMLU: measuring massive multitask language understanding in chinese. CoRR, abs/2306.09212.
  25. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  26. Split and merge: Aligning position biases in large language model based evaluators. CoRR, abs/2310.01432.
  27. Holistic evaluation of language models. CoRR, abs/2211.09110.
  28. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  29. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  30. The flan collection: Designing data and methods for effective instruction tuning. CoRR, abs/2301.13688.
  31. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, Dublin, Ireland. Association for Computational Linguistics.
  32. Crosslingual generalization through multitask finetuning. CoRR, abs/2211.01786.
  33. On “scientific debt” in NLP: A case for more rigour in language model pre-training research. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8554–8572, Toronto, Canada. Association for Computational Linguistics.
  34. Gpt-4 technical report.
  35. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  36. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  37. Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.
  38. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
  39. Code llama: Open foundation models for code.
  40. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  41. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  42. Anastasia Shimorina and Anya Belz. 2022. The human evaluation datasheet: A template for recording details of human evaluation experiments in NLP. In Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval), pages 54–75, Dublin, Ireland. Association for Computational Linguistics.
  43. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
  44. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  45. Llama: Open and efficient foundation language models.
  46. Llama 2: Open foundation and fine-tuned chat models.
  47. Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, pages 355–368, Tokyo, Japan. Association for Computational Linguistics.
  48. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. CoRR, abs/2306.07899.
  49. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3261–3275.
  50. Large language models are not fair evaluators. CoRR, abs/2305.17926.
  51. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560.
  52. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  53. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  54. Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation. arXiv preprint arXiv:2311.16511.
  55. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  56. Learning from task descriptions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375, Online. Association for Computational Linguistics.
  57. Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. CoRR, abs/2307.03025.
  58. Adapting large language models for document-level machine translation. arXiv preprint arXiv:2401.06468.
  59. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402.
  60. A survey of large language models. CoRR, abs/2303.18223.
  61. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882.
  62. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
  63. LIMA: less is more for alignment. CoRR, abs/2305.11206.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chenyang Lyu (44 papers)
  2. Minghao Wu (31 papers)
  3. Alham Fikri Aji (94 papers)
Citations (8)