An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4 (2403.02839v3)
Abstract: Recently, there has been a growing trend of utilizing LLM to evaluate the quality of other LLMs. Many studies have employed proprietary close-sourced models, especially GPT-4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of judge models. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness, aspect-specific evaluation, and scalability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations. Finally, we introduce a integrated method, leveraging GPT-4 to compensate for the limitations and improve the fine-tuned judges. Experiment results show our method achieves accuracy on par with GPT-4 with only 50% of the API expense.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
- Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Large language models are not fair evaluators. ArXiv, abs/2305.17926.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Promptbench: A unified library for evaluation of large language models. arXiv preprint arXiv:2312.07910.
- Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
- Hui Huang (159 papers)
- Yingqi Qu (11 papers)
- Jing Liu (525 papers)
- Muyun Yang (21 papers)
- Tiejun Zhao (70 papers)
- Xingyuan Bu (24 papers)
- Hongli Zhou (4 papers)
- Bing Xu (66 papers)