TreeEval: Benchmark-Free Evaluation of Large Language Models through Tree Planning (2402.13125v2)
Abstract: Recently, numerous new benchmarks have been established to evaluate the performance of LLMs via either computing a holistic score or employing another LLM as a judge. However, these approaches suffer from data leakage due to the open access of the benchmark and inflexible evaluation process. To address this issue, we introduce $\textbf{TreeEval}$, a benchmark-free evaluation method for LLMs that let a high-performance LLM host an irreproducible evaluation session and essentially avoids the data leakage. Moreover, this LLM performs as an examiner to raise up a series of questions under a topic with a tree planing strategy, which considers the current evaluation status to decide the next question generation and ensures the completeness and efficiency of the evaluation process. We evaluate $6$ models of different parameter sizes, including $7$B, $13$B, and $33$B, and ultimately achieved the highest correlation coefficient with AlpacaEval2.0 using only around $45$ questions. We also conduct more analysis to show the robustness and reliability of TreeEval. Our code can be accessed via the provided https://github.com/Ashura5/TreeEval.
- 01.AI. 2023. Yi.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- Chatgpt’s one-year anniversary: Are open-source large language models catching up? arXiv preprint arXiv:2311.16989.
- Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368.
- Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark. arXiv preprint arXiv:2305.14938.
- Hugo Touvron et al. 2023a. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Rohan Anil et al. 2023b. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- M. G. KENDALL. 1938. A NEW MEASURE OF RANK CORRELATION. Biometrika, 30(1-2):81–93.
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Jieyi Long. 2023. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123.
- NLP evaluation in trouble: On the need to measure LLM data contamination for each benchmark. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore. Association for Computational Linguistics.
- C. Spearman. 1904. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Stanford alpaca: An instruction-following llama model. GitHub repository.
- Gemini Team. 2023a. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Xwin-LM Team. 2023b. Xwin-lm.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- The alignment handbook. https://github.com/huggingface/alignment-handbook.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Secrets of rlhf in large language models part ii: Reward modeling. arXiv preprint arXiv:2401.06080.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Gpt-ner: Named entity recognition via large language models. arXiv preprint arXiv:2304.10428.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
- Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850.
- Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554.
- Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Secrets of rlhf in large language models part i: Ppo. arXiv preprint arXiv:2307.04964.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
- Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
- Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.