Generative Judge for Evaluating Alignment (2310.05470v2)

Published 9 Oct 2023 in cs.CL and cs.AI

Abstract: The rapid development of LLMs has substantially expanded the range of tasks they can address. In the field of NLP, researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/GAIR-NLP/auto-j.

PDF HTML Abstract

Evaluation of Generative Alignment Models: A Review of 'Generative Judge for Evaluating Alignment'

The paper "Generative Judge for Evaluating Alignment" introduces Auto-J, a generative evaluation model developed to address the emerging challenges in assessing the alignment of LLMs with human needs. This research is motivated by the shift in NLP tasks, moving from traditional activities like sequence tagging to those that align more closely with human-centric tasks such as brainstorming and email composition. This paradigm shift necessitates novel evaluation methodologies focusing on generality, flexibility, and interpretability.

Methodological Innovations

Auto-J is a genitive model with 13B parameters designed to function across a multitude of real-world scenarios, providing evaluations via pairwise comparison and single-response assessment. Its methodological uniqueness is two-fold:

Scenario and Criteria Definition: The authors define 58 distinct scenarios, accompanied by 332 evaluation criteria, designed to capture a comprehensive dataset of real-world queries and responses. This approach ensures that the model's evaluation process is informed by domain-specific knowledge, allowing it to address both content and format aspects relevant to different tasks.
Training with Real-World Data: Leveraging existing datasets like Chatbot Arena Conversations and the MTBench, Auto-J is trained with a rich blend of queries and model-generated responses across these scenarios. The incorporation of GPT-4 in generating evaluation judgments provides a benchmark of quality during training, underpinning a robust supervision structure.
Unified Evaluation Approach: By supporting both pairwise and single-response protocols, Auto-J boasts a high degree of flexibility. The model avoids explicit scenario criteria in its input to learn these contextual cues implicitly, thereby enhancing generality.

Empirical Evaluation

Auto-J excels in empirical evaluations, outperforming both open-source and proprietary model benchmarks in pairwise response assessments across all 58 scenarios. The consistency of Auto-J is notably high, drawing parallels with GPT-4 in stability even when presented with varied response sequences. The win-rate of Auto-J against other models, judged by both GPT-4 and human experts, demonstrates its superior capability to critique responses effectively with specificity and informativeness.

In practical terms, the model's ratings are effective in driving response selection within the Best-of- $N$ protocol, producing outputs with higher GPT-4 ratings. Auto-J’s capacity to provide well-structured critiques enhances the reliability and transparency of its ratings, encouraging a feedback loop for model refinement.

Theoretical and Practical Implications

The implications of this research extend beyond evaluation metrics. The architecture and training methodology of Auto-J point to a future where evaluative models are not only reactive but integrally connected to model training. This generative model can be a cornerstone in developing AI systems that are deeply aligned with user-specific goals and contextual nuances.

From a theoretical standpoint, Auto-J presents a compelling case for integrating generative assumptions directly into the evaluation process, significantly boosting both reliability and interpretability. Practically, the release of Auto-J and its attendant resources offers the research community a new toolkit for probing the alignment of LLMs with nuanced human-centric tasks.

Conclusion

This work not only advances the methodology for evaluating AI alignment, but it also sets a new benchmark for flexibility and depth in evaluative metrics. With Auto-J, the authors contribute a scalable and robust framework that is both resource and performance-efficient, laying foundational groundwork for future advancements in AI alignment evaluation. The open-source nature of Auto-J and its dataset further underscores its potential as a valuable asset for ongoing research in AI model alignment.

PDF Markdown Bookmark Chat (Pro)

References (50)

Authors (6)

Junlong Li (22 papers)
Shichao Sun (15 papers)
Weizhe Yuan (25 papers)
Run-Ze Fan (9 papers)
Hai Zhao (227 papers)
Pengfei Liu (191 papers)

Citations (50)

View on Semantic Scholar

GitHub

GitHub - GAIR-NLP/auto-j: Generative Judge for Evaluating Alignment (221 stars)