An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4 (2403.02839v3)

Published 5 Mar 2024 in cs.CL

Abstract: Recently, there has been a growing trend of utilizing LLM to evaluate the quality of other LLMs. Many studies have employed proprietary close-sourced models, especially GPT-4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of judge models. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness, aspect-specific evaluation, and scalability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations. Finally, we introduce a integrated method, leveraging GPT-4 to compensate for the limitations and improve the fine-tuned judges. Experiment results show our method achieves accuracy on par with GPT-4 with only 50% of the API expense.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (21)

Authors (8)

Hui Huang (159 papers)
Yingqi Qu (11 papers)
Jing Liu (525 papers)
Muyun Yang (21 papers)
Tiejun Zhao (70 papers)
Xingyuan Bu (24 papers)
Hongli Zhou (4 papers)
Bing Xu (66 papers)

Citations (11)

View on Semantic Scholar

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4 (2403.02839v3)

Related Papers