Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative Judge for Evaluating Alignment (2310.05470v2)

Published 9 Oct 2023 in cs.CL and cs.AI

Abstract: The rapid development of LLMs has substantially expanded the range of tasks they can address. In the field of NLP, researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/GAIR-NLP/auto-j.

Evaluation of Generative Alignment Models: A Review of 'Generative Judge for Evaluating Alignment'

The paper "Generative Judge for Evaluating Alignment" introduces Auto-J, a generative evaluation model developed to address the emerging challenges in assessing the alignment of LLMs with human needs. This research is motivated by the shift in NLP tasks, moving from traditional activities like sequence tagging to those that align more closely with human-centric tasks such as brainstorming and email composition. This paradigm shift necessitates novel evaluation methodologies focusing on generality, flexibility, and interpretability.

Methodological Innovations

Auto-J is a genitive model with 13B parameters designed to function across a multitude of real-world scenarios, providing evaluations via pairwise comparison and single-response assessment. Its methodological uniqueness is two-fold:

  1. Scenario and Criteria Definition: The authors define 58 distinct scenarios, accompanied by 332 evaluation criteria, designed to capture a comprehensive dataset of real-world queries and responses. This approach ensures that the model's evaluation process is informed by domain-specific knowledge, allowing it to address both content and format aspects relevant to different tasks.
  2. Training with Real-World Data: Leveraging existing datasets like Chatbot Arena Conversations and the MTBench, Auto-J is trained with a rich blend of queries and model-generated responses across these scenarios. The incorporation of GPT-4 in generating evaluation judgments provides a benchmark of quality during training, underpinning a robust supervision structure.
  3. Unified Evaluation Approach: By supporting both pairwise and single-response protocols, Auto-J boasts a high degree of flexibility. The model avoids explicit scenario criteria in its input to learn these contextual cues implicitly, thereby enhancing generality.

Empirical Evaluation

Auto-J excels in empirical evaluations, outperforming both open-source and proprietary model benchmarks in pairwise response assessments across all 58 scenarios. The consistency of Auto-J is notably high, drawing parallels with GPT-4 in stability even when presented with varied response sequences. The win-rate of Auto-J against other models, judged by both GPT-4 and human experts, demonstrates its superior capability to critique responses effectively with specificity and informativeness.

In practical terms, the model's ratings are effective in driving response selection within the Best-of-NN protocol, producing outputs with higher GPT-4 ratings. Auto-J’s capacity to provide well-structured critiques enhances the reliability and transparency of its ratings, encouraging a feedback loop for model refinement.

Theoretical and Practical Implications

The implications of this research extend beyond evaluation metrics. The architecture and training methodology of Auto-J point to a future where evaluative models are not only reactive but integrally connected to model training. This generative model can be a cornerstone in developing AI systems that are deeply aligned with user-specific goals and contextual nuances.

From a theoretical standpoint, Auto-J presents a compelling case for integrating generative assumptions directly into the evaluation process, significantly boosting both reliability and interpretability. Practically, the release of Auto-J and its attendant resources offers the research community a new toolkit for probing the alignment of LLMs with nuanced human-centric tasks.

Conclusion

This work not only advances the methodology for evaluating AI alignment, but it also sets a new benchmark for flexibility and depth in evaluative metrics. With Auto-J, the authors contribute a scalable and robust framework that is both resource and performance-efficient, laying foundational groundwork for future advancements in AI alignment evaluation. The open-source nature of Auto-J and its dataset further underscores its potential as a valuable asset for ongoing research in AI model alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  3. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9347–9359, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.751. URL https://aclanthology.org/2020.emnlp-main.751.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  6. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  12. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  13. Understanding dataset difficulty with V-usable information. In International Conference on Machine Learning, pp.  5988–6008. PMLR, 2022.
  14. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023.
  15. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.  10835–10866. PMLR, 2023.
  16. Alex Havrilla. synthetic-instruct-gptj-pairwise., 2023. URL https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise.
  17. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023.
  18. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023.
  19. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  20. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  21. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  22. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  23. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  24. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  25. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  26. Better summarization evaluation with word embeddings for rouge. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1925–1930, 2015.
  27. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  28. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  29. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  30. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. URL https://aclanthology.org/N18-1202.
  31. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  32. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  33. Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp.  551–564, 2021.
  34. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  35. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  3008–3021. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.
  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  39. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023a.
  40. Shepherd: A critic for language model generation. arXiv preprint arXiv:2308.04592, 2023b.
  41. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087, 2023c.
  42. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  43. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  44. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, May 2023. URL https://kaistai.github.io/SelFee/.
  45. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
  46. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  47. Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023.
  48. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  49. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  50. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Junlong Li (22 papers)
  2. Shichao Sun (15 papers)
  3. Weizhe Yuan (25 papers)
  4. Run-Ze Fan (9 papers)
  5. Hai Zhao (227 papers)
  6. Pengfei Liu (191 papers)
Citations (50)
Github Logo Streamline Icon: https://streamlinehq.com