Emergent Mind


Aligning large language models (LLMs) with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (RLHF) to fine-tune LLMs based on human labels assessing the relative quality of model responses. Nevertheless, RLHF is susceptible to instability during fine-tuning and presents challenges in implementation.Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM, and achieve precise control of model behavior by transforming its representations. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to implement.Extensive experiments demonstrate the efficacy of RAHF in not only capturing but also manipulating representations to align with a broad spectrum of human preferences or values, rather than being confined to a singular concept or function (e.g. honesty or bias). RAHF's versatility in accommodating diverse human preferences shows its potential for advancing LLM performance.
Comparative win rates of RAHF-SCIT, RAHF-DualLLMs, and others at various sampling temperatures, normalized.


  • The paper introduces a new method, Representation Alignment from Human Feedback (RAHF), as an alternative to RLHF for aligning LLMs with human preferences, focusing on internal representations rather than outputs.

  • RAHF includes two variants: one utilizes a singular instruction-tuned LLM, and the other employs dual LLMs for preferred and dispreferred outcomes.

  • Comparative experiments demonstrate RAHF, and particularly RAHF-DualLLMs, outperforming RL-free methods in aligning with human values according to various evaluation metrics.

  • Human evaluations provided more nuanced judgments compared to automated metrics, often resulting in ties rather than binary decisions.

  • The study suggests RAHF as an effective alternative to resource-intensive RL methods, paving the way for future development in controllable LLMs aligned with human ethical and preference frameworks.


In the realm of LLMs, achieving alignment with human preferences remains a significant challenge. Existing methods often rely on Reinforcement Learning from Human Feedback (RLHF), but such techniques encounter complexities and computational burdens. A novel method is introduced in the paper by Wenhao Liu, et al., from the School of Computer Science at Fudan University, aiming to streamline this process. This method, Representation Alignment from Human Feedback (RAHF), circumvents existing RLHF challenges by focusing on the representations within the LLMs and altering them to more accurately reflect human preferences.

Method and Related Work

This study takes an innovative path by identifying and manipulating latent representations in the LLM that correspond to human preferences. By altering these representations rather than the conventional model outputs directly, RAHF demonstrates an understanding of human preferences that extends beyond singular concepts to encompass a broad spectrum of values and preferences. This approach contrasts with RLHF techniques that solely fine-tune models using human-labeled data and rankings to elicit the desired response, potentially oversimplifying human preference expression.

The authors elaborate on two methods within RAHF: the first leverages a singular, instruction-tuned LLM capable of generating both preferred and dispreferred responses; the second uses dual LLMs individually optimized for preferred and dispreferred outcomes. These paradigms allow the LLM's behaviour to be fine-tuned through manipulation of its underlying representations, avoiding direct exposure to potentially biased or noisy data.

Experiments and Results

In a series of rigorous experiments, RAHF's effectiveness is validated against various competing methods under several metrics, including human evaluations and automated assessments like reward model scores and GPT-4 evaluations. The results show RAHF, especially the RAHF-DualLLMs variant, consistently surpassing RL-free approaches across both reward models. Notably, while DPO achieves higher rewards at higher temperatures, the instability associated with high-temperature sampling leads to a preference for results obtained at lower temperatures, where RAHF-DualLLMs exhibits solid performance.

The study also described extensive human evaluations corroborating the superiority of RAHF methods over RL-free techniques, hinting at better alignment with complex human preferences. Incidentally, human participants often offered 'tie' judgments more frequently compared to GPT-4's binary decisions, indicating nuances in human assessment that automated metrics may not capture.


The research at hand forges a path away from conventionally resource-heavy reinforcement learning techniques used to align LLMs with human preferences. RAHF introduces a method focusing on representation engineering, which advances the model's performance aligned with human values. Extending beyond the capabilities of existing RL-based or RL-free fine-tuning methods, RAHF serves as a compelling alternative, potentially catalyzing future strides in developing controllable LLMs. The study, therefore, marks a significant step towards refining our understanding and control over LLM outputs in a manner that is more attuned to human ethical and preference frameworks.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  2. Language models are few-shot learners. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2020).
  3. Sparks of Artificial General Intelligence: Early experiments with GPT-4
  4. Scaling Instruction-Finetuned Language Models
  5. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  6. LoRA: Low-Rank Adaptation of Large Language Models
  7. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’18).
  8. The Power of Scale for Parameter-Efficient Prompt Tuning
  9. Chain of Hindsight Aligns Language Models with Feedback
  10. GPT-4 Technical Report
  11. Training language models to follow instructions with human feedback. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2022).
  12. Improving language understanding by generative pre-training
  13. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  14. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  15. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. In Proceedings of the International Conference on Learning Representations (ICLR’23).
  16. Proximal Policy Optimization Algorithms
  17. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  18. Learning to summarize with human feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’20).
  19. Discriminative deep Dyna-Q: Robust planning for dialogue policy learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18).
  20. Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).
  21. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.

  22. LaMDA: Language Models for Dialog Applications
  23. LLaMA: Open and Efficient Foundation Language Models
  24. Hallucination detection for generative large language models by bayesian sequential estimation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15361–15371.
  25. Finetuned Language Models Are Zero-Shot Learners
  26. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256.
  27. Parameter efficient multi-task fine-tuning by learning to transfer token-wise prompts. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8734–8746.
  28. RRHF: Rank Responses to Align Language Models with Human Feedback without tears
  29. The wisdom of Hindsight makes language models better instruction followers. In Proceedings of the International Conference on Machine Learning (ICML’23).
  30. Budgeted policy learning for task-oriented dialogue systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’19).
  31. SLiC-HF: Sequence Likelihood Calibration with Human Feedback
  32. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  33. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning
  34. Fine-Tuning Language Models from Human Preferences
  35. Representation Engineering: A Top-Down Approach to AI Transparency

Show All 35

Test Your Knowledge

You answered out of questions correctly.

Well done!