JudgeLM: Fine-tuned Large Language Models are Scalable Judges (2310.17631v1)

Published 26 Oct 2023 in cs.CL and cs.AI

Abstract: Evaluating LLMs in open-ended scenarios is challenging because existing benchmarks and metrics can not measure them comprehensively. To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks. We first propose a comprehensive, large-scale, high-quality dataset containing task seeds, LLMs-generated answers, and GPT-4-generated judgments for fine-tuning high-performance judges, as well as a new benchmark for evaluating the judges. We train JudgeLM at different scales from 7B, 13B, to 33B parameters, and conduct a systematic analysis of its capabilities and behaviors. We then analyze the key biases in fine-tuning LLM as a judge and consider them as position bias, knowledge bias, and format bias. To address these issues, JudgeLM introduces a bag of techniques including swap augmentation, reference support, and reference drop, which clearly enhance the judge's performance. JudgeLM obtains the state-of-the-art judge performance on both the existing PandaLM benchmark and our proposed new benchmark. Our JudgeLM is efficient and the JudgeLM-7B only needs 3 minutes to judge 5K samples with 8 A100 GPUs. JudgeLM obtains high agreement with the teacher judge, achieving an agreement exceeding 90% that even surpasses human-to-human agreement. JudgeLM also demonstrates extended capabilities in being judges of the single answer, multimodal models, multiple answers, and multi-turn chat.

PDF Abstract

JudgeLM: Evaluating LLMs as Scalable Judges

The paper "JudgeLM: Fine-tuned LLMs are Scalable Judges" introduces a novel approach for evaluating LLMs using fine-tuned models designated as JudgeLM. The core motivation of this work is the inadequacy of existing benchmarks and metrics in comprehensively assessing LLMs, particularly in open-ended scenarios. This paper offers a methodological advancement by proposing fine-tuning LLMs to function as scalable evaluation entities or 'judges,' thereby resolving limitations associated with traditional evaluation metrics.

Comprehensive Dataset and Benchmark

The authors have curated a comprehensive, large-scale, and high-quality dataset essential for fine-tuning models to act as judges. This dataset comprises task seeds, answers generated by LLMs, and evaluations or judgments created by GPT-4. Additionally, the paper proposes a new benchmark tailored for assessing the efficacy of these fine-tuned judges. Such a dataset and benchmark are crucial for the systematic evaluation of the JudgeLM capabilities.

Model Scalability and Analysis

JudgeLM models are trained at varying scales: 7B, 13B, and 33B parameters, allowing for the exploration of scalability impacts. The paper provides a thorough analysis of the models’ behaviors and accuracies across these different scales. Special emphasis is placed on identifying key biases inherent in the fine-tuning process, categorized as position bias, knowledge bias, and format bias. Understanding these biases is vital for improving judgment accuracy.

Techniques to Mitigate Bias

Several methodological innovations are introduced to tackle these biases, such as swap augmentation, reference support, and reference drop techniques. These techniques were found to significantly enhance the precision and reliability of the JudgeLM models. The implementation of these techniques demonstrates improvements in managing and mitigating the identified biases during the judge model training process.

Performance and Efficiency

JudgeLM exhibits state-of-the-art performance on both the existing PandaLM benchmark and the newly proposed benchmark. Notably, JudgeLM-7B is highlighted for its efficiency, requiring only 3 minutes to assess 5,000 samples with 8 A100 GPUs. This efficiency underscores the practical applicability of JudgeLM in real-world scenarios where computational resources may be a limiting factor.

Additionally, JudgeLM's agreement with a reference teacher model exceeds 90%, which notably surpasses the human-to-human agreement maximum of 82% as referenced in the MT-bench paper. This high level of concordance suggests that JudgeLM is capable of providing more consistent and potentially more reliable judgments than human counterparts—at least in a controlled evaluation context.

Extended Capabilities

JudgeLM is not limited to traditional evaluation tasks; it also demonstrates proficiency in judging single answers, handling multimodal models, evaluating multiple answers, and participating in multi-turn chats. These extended capabilities indicate the potential versatility of JudgeLM in diverse applications within AI and LLM assessment.

Implications and Future Directions

The implications of this research are significant for both theoretical exploration and practical applications of AI. The paper's approach provides a robust framework for LLM evaluation that could inform future developments in both model training and evaluation strategies.

Possible future research directions include further exploration of bias mitigation techniques and the refinement of fine-tuning methodologies to expand the range and depth of judgments that JudgeLM can perform. Moreover, the application of such scalable judges in various domains—such as legal, academic, and technical fields—warrants further investigation to maximize their potential impact.

In conclusion, the JudgeLM framework offers a comprehensive and scalable solution to the challenges facing LLM evaluation, with substantial implications for the future of AI research and application.