Papers
Topics
Authors
Recent
2000 character limit reached

TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning (2402.13125v2)

Published 20 Feb 2024 in cs.CL and cs.AI

Abstract: Recently, numerous new benchmarks have been established to evaluate the performance of LLMs via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce $\textbf{TreeEval}$, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate $6$ models of different parameter sizes, including $7$B, $13$B, and $33$B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around $45$ questions. We also conduct more analysis to show the robustness and reliability of TreeEval. Our code can be accessed via the provided https://github.com/Ashura5/TreeEval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. 01.AI. 2023. Yi.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  3. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
  4. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  5. Chatgpt’s one-year anniversary: Are open-source large language models catching up? arXiv preprint arXiv:2311.16989.
  6. Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368.
  7. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark. arXiv preprint arXiv:2305.14938.
  10. Hugo Touvron et al. 2023a. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  11. Rohan Anil et al. 2023b. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  12. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  13. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825.
  15. M. G. KENDALL. 1938. A NEW MEASURE OF RANK CORRELATION. Biometrika, 30(1-2):81–93.
  16. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  17. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  18. G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  19. Jieyi Long. 2023. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291.
  20. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  21. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  22. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123.
  23. NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
  24. C. Spearman. 1904. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101.
  25. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  26. Stanford alpaca: An instruction-following llama model. GitHub repository.
  27. Gemini Team. 2023a. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  28. Xwin-LM Team. 2023b. Xwin-lm.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  30. The alignment handbook. https://github.com/huggingface/alignment-handbook.
  31. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
  32. Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080.
  33. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  34. Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428.
  35. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  36. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  37. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850.
  38. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554.
  39. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862.
  40. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  41. Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964.
  42. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
  43. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
  44. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
  45. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com