xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation (2405.11874v3)
Abstract: The continuous advancement of LLMs has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. As evaluation frameworks commonly use Regular Expression (RegEx) for answer extraction, models may adjust their responses to fit formats easily handled by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. Furthermore, recent studies proposing fine-tuned LLMs as judge models for automated evaluation face challenges in terms of generalization ability and fairness. This paper comprehensively analyzes the entire LLM evaluation chain and demonstrates that optimizing the key answer extraction module improves extraction accuracy and enhances evaluation reliability. Our findings suggest that improving the key answer extraction module can lead to higher judgment accuracy and improved evaluation efficiency compared to the judge models. To address these issues, we propose xFinder, a novel evaluator for answer extraction and matching in LLM evaluation. As part of this process, we create a specialized dataset, the \textbf{K}ey \textbf{A}nswer \textbf{F}inder (KAF) dataset, to ensure effective model training and evaluation. Generalization tests and real-world evaluations show that the smallest xFinder model, with only 500 million parameters, achieves an average extraction accuracy of 93.42\%. In contrast, RegEx accuracy in the best evaluation framework is 74.38\%. The final judgment accuracy of xFinder reaches 97.61\%, outperforming existing evaluation frameworks and judge models.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of NAACL-HLT, pages 4149–4158, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- A framework for few-shot language model evaluation, September 2021.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Can we trust the evaluation on chatgpt? In The Third Workshop on Trustworthy Natural Language Processing, page 47, 2023.
- Benchmarking llms via uncertainty quantification. arXiv preprint arXiv:2401.12794, 2024.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023.
- Proving test set contamination for black-box language models. In The Twelfth International Conference on Learning Representations, 2023.
- Political compass or spinning arrow? towards more meaningful evaluations for values and opinions in large language models. arXiv preprint arXiv:2402.16786, 2024.
- Robustness of learning from task instructions. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13935–13948, 2023.
- Evaluating the zero-shot robustness of instruction-tuned language models. In The Twelfth International Conference on Learning Representations, 2023.
- Improving the robustness of large language models via consistency alignment. arXiv preprint arXiv:2403.14221, 2024.
- Grimoire is all you need for enhancing large language models. arXiv preprint arXiv:2401.03385, 2024.
- Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations, 2023.
- An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers. arXiv preprint arXiv:2403.02839, 2024.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. In The Twelfth International Conference on Learning Representations, 2023.
- Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631, 2023.
- Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models. arXiv preprint arXiv:2401.17043, 2024.
- Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. arXiv preprint arXiv:2311.15296, 2023.
- Newsbench: Systematic evaluation of llms for writing proficiency and safety adherence in chinese journalistic editorial applications. arXiv preprint arXiv:2403.00862, 2024.
- Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023.
- OpenAI. GitHub - openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
- Ultraeval: A lightweight platform for flexible and comprehensive evaluation for llms. arXiv preprint arXiv:2404.07584, 2024.
- " my answer is c": First-token probabilities do not match text answers in instruction-tuned language models. arXiv preprint arXiv:2402.14499, 2024.
- Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491, 2023.
- Are large language models reliable judges? a study on the factuality evaluation capabilities of llms. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 310–316, 2023.
- Factuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Beyond the answers: Reviewing the rationality of multiple choice question answering for the evaluation of large language models. arXiv preprint arXiv:2402.01349, 2024.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023.
- Label Studio: Data labeling software, 2020-2022. Open source software available from https://github.com/heartexlabs/label-studio.
- XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/xtuner, 2023.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
Collections
Sign up for free to add this paper to one or more collections.