AI Research Assistant for Computer Scientists
Synthesize the latest research on any AI/ML/CS topic
Introduction
In the field of LLMs °, achieving alignment with human preferences ° remains a significant challenge. Existing methods often rely on Reinforcement Learning from Human Feedback ° (RLHF °), but such techniques encounter complexities and computational burdens. A novel method is introduced in the paper by Wenhao Liu, et al., from the School of Computer Science at Fudan University, aiming to streamline this process. This method, Representation Alignment ° from Human Feedback (RAHF), circumvents existing RLHF challenges by focusing on the representations within the LLMs and altering them to more accurately reflect human preferences.
Method and Related Work
This paper takes an innovative path by identifying and manipulating latent representations in the LLM ° that correspond to human preferences. By altering these representations rather than the conventional model outputs ° directly, RAHF demonstrates an understanding of human preferences that extends beyond singular concepts to encompass a broad spectrum of values and preferences. This approach contrasts with RLHF techniques ° that solely fine-tune models ° using human-labeled data ° and rankings to elicit the desired response, potentially oversimplifying human preference expression.
The authors elaborate on two methods within RAHF: the first leverages a singular, instruction-tuned LLM ° capable of generating both preferred and dispreferred responses; the second uses dual LLMs individually optimized for preferred and dispreferred outcomes. These paradigms allow the LLM's behaviour to be fine-tuned through manipulation of its underlying representations, avoiding direct exposure to potentially biased or noisy data.
Experiments and Results
In a series of rigorous experiments, RAHF's effectiveness is validated against various competing methods under several metrics, including human evaluations and automated assessments like reward model scores and GPT-4 ° evaluations. The results show RAHF, especially the RAHF-DualLLMs variant, consistently surpassing RL-free approaches across both reward models °. Notably, while DPO ° achieves higher rewards at higher temperatures, the instability associated with high-temperature sampling ° leads to a preference for results obtained at lower temperatures, where RAHF-DualLLMs exhibits solid performance.
The paper also described extensive human evaluations corroborating the superiority of RAHF methods over RL-free techniques, hinting at better alignment with complex human preferences. Incidentally, human participants often offered 'tie' judgments more frequently compared to GPT-4's binary decisions, indicating nuances in human assessment that automated metrics ° may not capture.
Conclusion
The research at hand forges a path away from conventionally resource-heavy reinforcement learning techniques ° used to align LLMs with human preferences. RAHF introduces a method focusing on representation engineering, which advances the model's performance aligned with human values. Extending beyond the capabilities of existing RL-based or RL-free fine-tuning methods, RAHF serves as a compelling alternative, potentially catalyzing future strides in developing controllable LLMs. The paper, therefore, marks a significant step towards refining our understanding and control over LLM outputs in a manner that is more attuned to human ethical and preference frameworks.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862. °
- Language models are few-shot learners. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2020).
- Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712. °
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416. °
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685. °
- Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’18).
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691. °
- Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676. °
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774. °
- Training language models to follow instructions with human feedback. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2022).
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290. °
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. In Proceedings of the International Conference on Learning Representations (ICLR’23).
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. °
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615. °
- Learning to summarize with human feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’20).
- Discriminative deep Dyna-Q: Robust planning for dialogue policy learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18).
- Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239. °
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971. °
- Hallucination detection for generative large language models by bayesian sequential estimation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15361–15371.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652. °
- Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256.
- Parameter efficient multi-task fine-tuning by learning to transfer token-wise prompts. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8734–8746.
- RRHF: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302. °
- The wisdom of Hindsight makes language models better instruction followers. In Proceedings of the International Conference on Machine Learning (ICML’23).
- Budgeted policy learning for task-oriented dialogue systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’19).
- SLIC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425. °
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685. °
- Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103. °
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593. °
- Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405. °
- Wenhao Liu ° (74 papers)
- Xiaohua Wang ° (22 papers)
- Muling Wu ° (11 papers)
- Tianlong Li ° (11 papers)
- Changze Lv ° (18 papers)
- Zixuan Ling ° (8 papers)
- Jianhao Zhu ° (4 papers)
- Cenyuan Zhang ° (10 papers)
- Xiaoqing Zheng ° (39 papers)
- Xuanjing Huang ° (269 papers)