Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs (2404.07584v3)

Published 11 Apr 2024 in cs.CL

Abstract: Evaluation is pivotal for refining LLMs, pinpointing their capabilities, and guiding enhancements. The rapid development of LLMs calls for a lightweight and easy-to-use framework for swift evaluation deployment. However, considering various implementation details, developing a comprehensive evaluation platform is never easy. Existing platforms are often complex and poorly modularized, hindering seamless incorporation into research workflows. This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. We identify and reimplement three core components of model evaluation (models, data, and metrics). The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow. Additionally, UltraEval supports diverse models owing to a unified HTTP service and provides sufficient inference acceleration. UltraEval is now available for researchers publicly.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. 2024. Minicpm: Unveiling the potential of end-side large language models.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  3. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  4. Can gpt-3 perform statutory reasoning? In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, pages 22–31.
  5. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  7. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  9. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
  10. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  12. OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
  13. A framework for few-shot language model evaluation.
  14. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems. arXiv preprint arXiv:2402.14008.
  15. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  16. Unlock predictable scaling from emergent abilities. arXiv preprint arXiv:2310.03262.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825.
  18. Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
  19. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
  20. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  21. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  22. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  23. Diminished diversity-of-thought in a standard large language model. Behavior Research Methods, pages 1–17.
  24. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  25. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  28. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
  29. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
  30. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  31. Xuanyu Zhang and Qing Yang. 2023. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4435–4439.
  32. A survey of large language models. arXiv preprint arXiv:2303.18223.
  33. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  34. Judging llm-as-a-judge with mt-bench and chatbot arena.
  35. GPT-Fathom: Benchmarking large language models to decipher the evolutionary path towards GPT-4 and beyond.
  36. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chaoqun He (5 papers)
  2. Renjie Luo (7 papers)
  3. Shengding Hu (34 papers)
  4. Yuanqian Zhao (3 papers)
  5. Jie Zhou (687 papers)
  6. Hanghao Wu (2 papers)
  7. Jiajie Zhang (30 papers)
  8. Xu Han (270 papers)
  9. Zhiyuan Liu (433 papers)
  10. Maosong Sun (337 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com