Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4 (2403.02839v3)

Published 5 Mar 2024 in cs.CL

Abstract: Recently, there has been a growing trend of utilizing LLM to evaluate the quality of other LLMs. Many studies have employed proprietary close-sourced models, especially GPT-4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of judge models. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness, aspect-specific evaluation, and scalability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations. Finally, we introduce a integrated method, leveraging GPT-4 to compensate for the limitations and improve the fine-tuned judges. Experiment results show our method achieves accuracy on par with GPT-4 with only 50% of the API expense.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
  5. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
  6. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  7. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  8. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  9. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
  10. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  11. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  12. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  13. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076.
  14. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  15. Large language models are not fair evaluators. ArXiv, abs/2305.17926.
  16. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
  17. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  18. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
  19. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  20. Promptbench: A unified library for evaluation of large language models. arXiv preprint arXiv:2312.07910.
  21. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hui Huang (159 papers)
  2. Yingqi Qu (11 papers)
  3. Jing Liu (525 papers)
  4. Muyun Yang (21 papers)
  5. Tiejun Zhao (70 papers)
  6. Xingyuan Bu (24 papers)
  7. Hongli Zhou (4 papers)
  8. Bing Xu (66 papers)
Citations (11)