Emergent Mind

Generative Judge for Evaluating Alignment

Published Oct 9, 2023 in cs.CL and cs.AI


The rapid development of Large Language Models (LLMs) has substantially expanded the range of tasks they can address. In the field of Natural Language Processing (NLP), researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates diverse evaluation protocols (e.g., pairwise response comparison and single-response evaluation) with well-structured natural language critiques. To demonstrate the efficacy of our approach, we construct a new testbed covering 58 different scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models, by a large margin. We also provide detailed analysis and case studies to further reveal the potential of our method and make a variety of resources public at https://github.com/GAIR-NLP/auto-j.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  2. Constitutional AI: Harmlessness from AI Feedback
  3. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9347–9359, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.751. https://aclanthology.org/2020.emnlp-main.751.

  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  5. Sparks of Artificial General Intelligence: Early experiments with GPT-4
  6. Training Deep Nets with Sublinear Memory Cost
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.

  8. PaLM: Scaling Language Modeling with Pathways
  9. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. https://aclanthology.org/N19-1423.

  12. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  13. Understanding dataset difficulty with V-usable information. In International Conference on Machine Learning, pp.  5988–6008. PMLR
  14. GPTScore: Evaluate as You Desire
  15. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.  10835–10866. PMLR
  16. Alex Havrilla. synthetic-instruct-gptj-pairwise., 2023. https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise.

  17. BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
  18. OpenAssistant Conversations -- Democratizing Large Language Model Alignment
  19. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. https://aclanthology.org/2020.acl-main.703.

  20. Let's Verify Step by Step
  21. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. https://aclanthology.org/W04-1013.

  22. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  23. Decoupled Weight Decay Regularization
  24. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26
  25. WebGPT: Browser-assisted question-answering with human feedback
  26. Better summarization evaluation with word embeddings for rouge. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1925–1930
  27. GPT-4 Technical Report
  28. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  27730–27744. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.

  29. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. https://aclanthology.org/P02-1040.
  30. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  2227–2237, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1202. https://aclanthology.org/N18-1202.

  31. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE
  32. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506
  33. Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pp.  551–564
  34. Self-critiquing models for assisting human evaluators
  35. Learning to summarize with human feedback. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  3008–3021. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/1f89885d556929e98d3ef9b86448f951-Paper.pdf.

  36. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  37. LLaMA: Open and Efficient Foundation Language Models
  38. Llama 2: Open Foundation and Fine-Tuned Chat Models
  39. Large Language Models are not Fair Evaluators
  40. Shepherd: A Critic for Language Model Generation
  41. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
  42. Self-Instruct: Aligning Language Models with Self-Generated Instructions
  43. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  44. Selfee: Iterative self-revising llm empowered by self-feedback generation. Blog post, May 2023. https://kaistai.github.io/SelFee/.

  45. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277
  46. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations
  47. Wider and Deeper LLM Networks are Fairer LLM Evaluators
  48. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  49. LIMA: Less Is More for Alignment
  50. Fine-Tuning Language Models from Human Preferences

Show All 50

Test Your Knowledge

You answered out of questions correctly.

Well done!