Papers
Topics
Authors
Recent
2000 character limit reached

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4

Published 5 Mar 2024 in cs.CL | (2403.02839v4)

Abstract: Recently, there has been a growing trend of utilizing LLM to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations.
  5. Prometheus: Inducing fine-grained evaluation capability in language models. arXiv preprint arXiv:2310.08491.
  6. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  7. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  8. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  9. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
  10. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  11. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  12. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  13. Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076.
  14. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  15. Large language models are not fair evaluators. ArXiv, abs/2305.17926.
  16. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
  17. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  18. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
  19. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  20. Promptbench: A unified library for evaluation of large language models. arXiv preprint arXiv:2312.07910.
  21. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Citations (11)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.