Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach (2403.15250v2)

Published 22 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Falcon-40b: an open large language model with state-of-the-art performance. Findings of the Association for Computational Linguistics: ACL, 2023:10755–10773.
  2. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  3. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  5. Cognitive abilities and their interplay: Reasoning, crystallized intelligence, working memory components, and sustained attention. Journal of Individual Differences, 27(2):57–72.
  6. Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062.
  7. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  8. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
  9. Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark. arXiv preprint arXiv:2305.14938.
  10. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  12. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  14. Training verifiers to solve math word problems.
  15. A latent variable analysis of working memory capacity, short-term memory capacity, processing speed, and general fluid intelligence. Intelligence, 30(2):163–183.
  16. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  17. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306.
  18. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764.
  19. A framework for few-shot language model evaluation.
  20. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
  21. Measuring massive multitask language understanding.
  22. Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507.
  23. Mistral 7b. arXiv preprint arXiv:2310.06825.
  24. Structgpt: A general framework for large language model to reason over structured data. arXiv preprint arXiv:2305.09645.
  25. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  26. Langley, P. (2000). Crafting papers on machine learning. In Langley, P., editor, Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pages 1207–1216, Stanford, CA. Morgan Kaufmann.
  27. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
  28. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  29. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  30. Truthfulqa: Measuring how models mimic human falsehoods.
  31. Patel, D. (2024). Will scaling work. Blogpost, https://www.dwarkeshpatel.com/p/will-scaling-work.
  32. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
  33. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  34. Personality traits in large language models. arXiv preprint arXiv:2307.00184.
  35. WINOGRANDE: an adversarial winograd schema challenge at scale.
  36. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
  37. The relationship between reasoning and language ability: Comparing children with cochlear implants and children with typical hearing. Logopedics Phoniatrics Vocology, 47(2):73–83.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  39. Visualizing data using t-sne. Journal of Machine Learning Research, 9(11).
  40. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  41. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  42. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  43. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  44. Package ‘mgcv’. R package version, 1(29):729.
  45. Wood, S. N. (2017). Generalized additive models: An introduction with r.
  46. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  47. Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296.
  48. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
  49. Revisiting out-of-distribution robustness in nlp: Benchmark, analysis, and llms evaluations. arXiv preprint arXiv:2306.04618.
  50. Hellaswag: Can a machine really finish your sentence?
  51. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  52. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  53. A survey of large language models. arXiv preprint arXiv:2303.18223.
  54. Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964.
Citations (2)

Summary

We haven't generated a summary for this paper yet.