Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks (2405.01719v2)

Published 2 May 2024 in cs.LG

Abstract: We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. ArXiv, abs/2402.01781, 2024. URL https://api.semanticscholar.org/CorpusID:267412932.
  2. Sanjeev Arora and Yi Zhang. Rip van winkle’s razor: A simple estimate of overfit to test data. ArXiv, abs/2102.13189, 2021. URL https://api.semanticscholar.org/CorpusID:232069109.
  3. Kenneth J Arrow. A difficulty in the concept of social welfare. Journal of political economy, 58(4):328–346, 1950.
  4. Kenneth J. Arrow. Social choice and individual values. 1951. URL https://api.semanticscholar.org/CorpusID:144910513.
  5. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  6. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  7. Estimating or propagating gradients through stochastic neurons for conditional computation. ArXiv, abs/1308.3432, 2013. URL https://api.semanticscholar.org/CorpusID:18406556.
  8. The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning, 2015. URL https://api.semanticscholar.org/CorpusID:1493191.
  9. Elo uncovered: Robustness and best practices in language model evaluation. ArXiv, abs/2311.17295, 2023. URL https://api.semanticscholar.org/CorpusID:265498394.
  10. What will it take to fix benchmarking in natural language understanding? ArXiv, abs/2104.02145, 2021. URL https://api.semanticscholar.org/CorpusID:233033916.
  11. Infolm: A new metric to evaluate summarization & data2text generation. In AAAI Conference on Artificial Intelligence, 2021. URL https://api.semanticscholar.org/CorpusID:244896426.
  12. What are the best systems? new perspectives on nlp benchmarking. ArXiv, abs/2202.03799, 2022. URL https://api.semanticscholar.org/CorpusID:246652319.
  13. The benchmark lottery. ArXiv, abs/2107.07002, 2021. URL https://api.semanticscholar.org/CorpusID:235810239.
  14. Imagenet: A large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. URL https://api.semanticscholar.org/CorpusID:57246310.
  15. David Donoho. Data science at the singularity. Issue 6.1, Winter 2024, 2023. URL https://api.semanticscholar.org/CorpusID:263605559.
  16. Stavros A. Drakopoulos. The historical perspective of the problem of interpersonal comparisons of utility. Journal of Economic Studies, 16, 1989. URL https://api.semanticscholar.org/CorpusID:55684805.
  17. Preserving statistical validity in adaptive data analysis. Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, 2014. URL https://api.semanticscholar.org/CorpusID:2209606.
  18. The reusable holdout: Preserving validity in adaptive data analysis. Science, 349:636 – 638, 2015. URL https://api.semanticscholar.org/CorpusID:15569600.
  19. Utility is in the eye of the user: A critique of nlp leaderboard design. In Conference on Empirical Methods in Natural Language Processing, 2020. URL https://api.semanticscholar.org/CorpusID:235408131.
  20. The advantages of multiple classes for reducing overfitting from test set reuse. ArXiv, abs/1905.10360, 2019. URL https://api.semanticscholar.org/CorpusID:165163539.
  21. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  22. Patterns, predictions, and actions: Foundations of machine learning. Princeton University Press, 2022.
  23. Measuring massive multitask language understanding. ArXiv, abs/2009.03300, 2020. URL https://api.semanticscholar.org/CorpusID:221516475.
  24. Towards more robust nlp system evaluation: Handling missing scores in benchmarks. ArXiv, abs/2305.10284, 2023. URL https://api.semanticscholar.org/CorpusID:258741244.
  25. Categorical reparameterization with gumbel-softmax. ArXiv, abs/1611.01144, 2016. URL https://api.semanticscholar.org/CorpusID:2428314.
  26. Jerry S Kelly. Social choice theory: An introduction. Springer Science & Business Media, 1988.
  27. Dynabench: Rethinking benchmarking in nlp. ArXiv, abs/2104.14337, 2021. URL https://api.semanticscholar.org/CorpusID:233444226.
  28. Reduced, reused and recycled: The life of a dataset in machine learning research. ArXiv, abs/2112.01716, 2021. URL https://api.semanticscholar.org/CorpusID:244894836.
  29. Holistic evaluation of text-to-image models. ArXiv, abs/2311.04287, 2023. URL https://api.semanticscholar.org/CorpusID:265051037.
  30. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140 – 146, 2023. URL https://api.semanticscholar.org/CorpusID:253553585.
  31. Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  32. Data contamination: From memorization to exploitation. ArXiv, abs/2203.08242, 2022. URL https://api.semanticscholar.org/CorpusID:247475929.
  33. TorchVision maintainers and contributors. Torchvision: Pytorch’s computer vision library. https://github.com/pytorch/vision, 2016.
  34. Model similarity mitigates test set overuse. ArXiv, abs/1905.12580, 2019. URL https://api.semanticscholar.org/CorpusID:168169971.
  35. How robust are model rankings : A leaderboard customization approach for equitable evaluation. In AAAI Conference on Artificial Intelligence, 2021. URL https://api.semanticscholar.org/CorpusID:235363537.
  36. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022. doi: 10.48550/ARXIV.2210.07316. URL https://arxiv.org/abs/2210.07316.
  37. OpenAI. Gpt-4 technical report. arXiv, 2023. URL http://arxiv.org/abs/2303.08774.
  38. Mapping global dynamics of benchmark creation and saturation in artificial intelligence. Nature Communications, 13, 2022. URL https://api.semanticscholar.org/CorpusID:247318891.
  39. Learning to score system summaries for better content selection evaluation. In Lu Wang, Jackie Chi Kit Cheung, Giuseppe Carenini, and Fei Liu, editors, Proceedings of the Workshop on New Frontiers in Summarization, pages 74–84, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4510. URL https://aclanthology.org/W17-4510.
  40. Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021. URL https://api.semanticscholar.org/CorpusID:232035663.
  41. Data contamination through the lens of time. ArXiv, abs/2310.10628, 2023. URL https://api.semanticscholar.org/CorpusID:264172693.
  42. Vote’n’rank: Revision of benchmarking with social choice theory. ArXiv, abs/2210.05769, 2022. URL https://api.semanticscholar.org/CorpusID:252846467.
  43. Amartya Sen. Collective choice and social welfare. 2017. URL https://api.semanticscholar.org/CorpusID:154085126.
  44. Mathematics without numbers. Noûs, 27:522, 1993. URL https://api.semanticscholar.org/CorpusID:170189790.
  45. A theory of dynamic benchmarks. In International Conference on Learning Representations (ICLR), 2023.
  46. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. ArXiv, abs/2206.04615, 2022. URL https://api.semanticscholar.org/CorpusID:263625818.
  47. Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. arXiv: Computation and Language, 2019. URL https://api.semanticscholar.org/CorpusID:213613608.
  48. Challenging big-bench tasks and whether chain-of-thought can solve them. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:252917648.
  49. Alan D Taylor. Social choice and the mathematics of manipulation. Cambridge University Press, 2005.
  50. Gemini Team. Gemini: A family of highly capable multimodal models. arXiv, 2023.
  51. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a. URL https://api.semanticscholar.org/CorpusID:257219404.
  52. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b. URL https://api.semanticscholar.org/CorpusID:259950998.
  53. From imagenet to image classification: Contextualizing progress on benchmarks. In International Conference on Machine Learning, 2020. URL https://api.semanticscholar.org/CorpusID:218862858.
  54. Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLP@EMNLP, 2018. URL https://api.semanticscholar.org/CorpusID:5034059.
  55. Superglue: A stickier benchmark for general-purpose language understanding systems. ArXiv, abs/1905.00537, 2019. URL https://api.semanticscholar.org/CorpusID:143424870.
  56. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv: Computer Vision and Pattern Recognition, 2019. URL https://api.semanticscholar.org/CorpusID:214317405.
  57. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering, 48:1–36, 2019. URL https://api.semanticscholar.org/CorpusID:195657970.
  58. Llmeval: A preliminary study on how to evaluate large language models. In AAAI Conference on Artificial Intelligence, 2023. URL https://api.semanticscholar.org/CorpusID:266174168.
Citations (4)

Summary

We haven't generated a summary for this paper yet.