Multi-objective Reinforcement learning from AI Feedback (2406.07295v2)

Published 11 Jun 2024 in cs.LG

Abstract: This paper presents Multi-Objective Reinforcement Learning from AI Feedback (MORLAIF), a novel approach to improving the alignment and performance of LLMs trained using reinforcement learning from AI feedback (RLAIF). In contrast to standard approaches that train a single preference model to represent all human preferences, MORLAIF decomposes this task into multiple simpler principles, such as toxicity, factuality, and sycophancy. Separate preference models are trained for each principle using feedback from GPT-3.5-Turbo. These preference model scores are then combined using different scalarization functions to provide a reward signal for Proximal Policy Optimization (PPO) training of the target LLM. Our experiments indicate that MORLAIF outperforms the standard RLAIF baselines and that MORLAIF can be used to align larger LLMs using smaller ones. Surprisingly, the choice of scalarization function does not appear to significantly impact the results.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (1)

Marcus Williams (3 papers)

Tweets

https://twitter.com/realmofresearch/status/1802367950860742914

Multi-objective Reinforcement learning from AI Feedback (2406.07295v2)

Related Papers

Tweets