Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models (2404.06003v1)

Published 9 Apr 2024 in cs.CL and cs.AI

Abstract: The rapid development of LLM evaluation methodologies and datasets has led to a profound challenge: integrating state-of-the-art evaluation techniques cost-effectively while ensuring reliability, reproducibility, and efficiency. Currently, there is a notable absence of a unified and adaptable framework that seamlessly integrates various evaluation approaches. Moreover, the reliability of evaluation findings is often questionable due to potential data contamination, with the evaluation efficiency commonly overlooked when facing the substantial costs associated with LLM inference. In response to these challenges, we introduce FreeEval, a modular and scalable framework crafted to enable trustworthy and efficient automatic evaluations of LLMs. Firstly, FreeEval's unified abstractions simplify the integration and improve the transparency of diverse evaluation methodologies, encompassing dynamic evaluation that demand sophisticated LLM interactions. Secondly, the framework integrates meta-evaluation techniques like human evaluation and data contamination detection, which, along with dynamic evaluation modules in the platform, enhance the fairness of the evaluation outcomes. Lastly, FreeEval is designed with a high-performance infrastructure, including distributed computation and caching strategies, enabling extensive evaluations across multi-node, multi-GPU clusters for open-source and proprietary LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  2. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR.
  3. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  5. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  6. Yoshua Bengio and Yann LeCun. 2007. Scaling learning algorithms towards AI. In Large Scale Kernel Machines. MIT Press.
  7. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288.
  8. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158.
  9. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
  10. Holistic evaluation of language models. Annals of the New York Academy of Sciences.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  12. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  13. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
  14. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  15. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  16. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  17. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  18. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  19. Contributors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  20. Contributors. 2023. Openai evals. https://github.com/openai/evals.
  21. Contributors. 2023a. Text generation inference: A rust, python and grpc server for text generation inference. https://github.com/huggingface/text-generation-inference.
  22. OpenCompass Contributors. 2023b. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
  23. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(comming soon).
  24. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208.
  25. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.
  26. Robustness challenges in model distillation and pruning for natural language understanding. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1758–1770.
  27. Do membership inference attacks work on large language models? arXiv preprint arXiv:2402.07841.
  28. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  29. Dom Eccleston. 2023. Sharegpt dataset. https://sharegpt.com/.
  30. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  31. A framework for few-shot language model evaluation.
  32. Deep learning tuning playbook. Version 1.0.
  33. Deep learning, volume 1. MIT Press.
  34. Google. 2023. Bard.
  35. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
  36. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  37. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554.
  38. Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering, 7(4):275–300.
  39. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  40. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  41. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  42. Mistral 7b. arXiv preprint arXiv:2310.06825.
  43. The perils of using mechanical turk to evaluate open-ended text generation. arXiv preprint arXiv:2109.06835.
  44. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  45. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  46. Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762.
  47. AlpacaEval: An Automatic Evaluator of Instruction-following Models.
  48. Yucheng Li. 2023. An open source data contamination report for llama series models. arXiv preprint arXiv:2310.17589.
  49. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  50. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  51. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  52. Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711.
  53. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  54. Large language models for structured reporting in radiology: performance of gpt-4, chatgpt-3.5, perplexity and bing. La radiologia medica, pages 1–5.
  55. MosaicML. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms.
  56. Why we need new evaluation metrics for nlg. arXiv preprint arXiv:1707.06875.
  57. OpenAI. 2023. Gpt-4 technical report.
  58. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623.
  59. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  60. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  61. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  62. Validity problems comparing values across cultures and possible solutions. Psychological methods, 2(4):329.
  63. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828.
  64. Improving language understanding by generative pre-training.
  65. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14.
  66. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  67. Leveraging large language models for multiple choice question answering. ArXiv, abs/2210.12353.
  68. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018.
  69. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  70. Rylan Schaeffer. 2023. Pretraining on the test set is all you need. arXiv preprint arXiv:2309.08632.
  71. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  72. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  73. How to fine-tune bert for text classification? In Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019, Proceedings 18, pages 194–206. Springer.
  74. iEval: Interactive evaluation framework for open-domain empathetic chatbots. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 419–431, Edinburgh, UK. Association for Computational Linguistics.
  75. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  76. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  77. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  78. Natural language processing with transformers. " O’Reilly Media, Inc.".
  79. Attention is all you need. Advances in neural information processing systems, 30.
  80. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  81. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity.
  82. Novelqa: A benchmark for long-range novel question answering.
  83. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  84. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  85. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  86. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341.
  87. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  88. Codeshell technical report.
  89. Clue: A chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4762–4772.
  90. Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073.
  91. Supervised knowledge makes large language models better in-context learners.
  92. Slurm: Simple linux utility for resource management. In Workshop on job scheduling strategies for parallel processing, pages 44–60. Springer.
  93. Kieval: A knowledge-grounded interactive evaluation framework for large language models. arXiv preprint arXiv:2402.15043.
  94. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  95. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  96. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  97. Coderujb: An executable and unified java benchmark for practical programming scenarios.
  98. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
  99. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  100. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  101. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
  102. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  103. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
  104. Dyval: Graph-informed dynamic evaluation of large language models. arXiv preprint arXiv:2309.17167.
  105. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhuohao Yu (15 papers)
  2. Chang Gao (54 papers)
  3. Wenjin Yao (3 papers)
  4. Yidong Wang (43 papers)
  5. Zhengran Zeng (9 papers)
  6. Wei Ye (110 papers)
  7. Jindong Wang (150 papers)
  8. Yue Zhang (618 papers)
  9. Shikun Zhang (82 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com