Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension (2403.07872v1)

Published 12 Mar 2024 in cs.CL

Abstract: Despite their sophisticated capabilities, LLMs encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called ``Real-world questions'' (RWQ), comprising 20,772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  4. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  5. bench authors, B. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  6. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  7. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pp.  12–58, 2014.
  8. Findings of the 2016 conference on machine translation (wmt16). In First conference on machine translation, pp.  131–198. Association for Computational Linguistics, 2016.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  13. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  14. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  15. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  16. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  17. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  18. Contributors, O. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  19. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  20. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  21. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  22. Anytool: Self-reflective, hierarchical agents for large-scale api calls. arXiv preprint arXiv:2402.04253, 2024.
  23. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833, 2018.
  24. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  25. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
  26. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  27. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  28. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  29. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  30. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  31. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, 2023.
  32. Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  8082–8090, 2020.
  33. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  34. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  35. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
  36. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124, 2023b.
  37. Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077, 2024.
  38. Opensubtitles2016: Extracting large parallel corpora from movie and tv subtitles. 2016.
  39. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  40. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  41. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  42. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  43. Improving language understanding by generative pre-training. 2018.
  44. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  45. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  46. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  47. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  48. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  49. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  50. Shazeer, N. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  51. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  52. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  53. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  54. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  55. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023a.
  56. Stanford alpaca: An instruction-following llama model, 2023b.
  57. Team, M. et al. Introducing mpt-7b: a new standard for open-source, commercially usable llms, 2023a.
  58. Team, M. N. et al. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023b.
  59. Team, X.-L. Xwin-lm, 9 2023. URL https://github. com/Xwin-LM/Xwin-LM.
  60. Team, X.-L. Xwin-lm, 9 2023. URL https://github.com/Xwin-LM/Xwin-LM.
  61. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  62. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  63. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  64. A holistic framework to improve the uptake and impact of ehealth technologies. Journal of medical Internet research, 13(4):e1672, 2011.
  65. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  66. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  67. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  68. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  69. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  70. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  71. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp.  19–27, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Fangyun Wei (53 papers)
  2. Xi Chen (1035 papers)
  3. Lin Luo (27 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com