The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedback (2405.11226v1)
Abstract: Reinforcement learning from human feedback (RLHF) has contributed to performance improvements in LLMs. To tackle its reliance on substantial amounts of human-labeled data, a successful approach is multi-task representation learning, which involves learning a high-quality, low-dimensional representation from a wide range of source tasks. In this paper, we formulate RLHF as the contextual dueling bandit problem and assume a common linear representation. We demonstrate that the sample complexity of source tasks in multi-task RLHF can be reduced by considering task relevance and allocating different sample sizes to source tasks with varying task relevance. We further propose an algorithm to estimate task relevance by a small number of additional data and then learn a policy. We prove that to achieve $\varepsilon-$optimal, the sample complexity of the source tasks can be significantly reduced compared to uniform sampling. Additionally, the sample complexity of the target task is only linear in the dimension of the latent space, thanks to representation learning.
- Instance-wise minimax-optimal algorithms for logistic bandits. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (A. Banerjee and K. Fukumizu, eds.), vol. 130 of Proceedings of Machine Learning Research. PMLR.
- Linear thompson sampling revisited. In Artificial Intelligence and Statistics. PMLR.
- Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning. PMLR.
- A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of machine learning research, 6.
- Provable representation learning for imitation learning via bi-level optimization. In International Conference on Machine Learning. PMLR.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Baxter, J. (2000). A model of inductive bias learning. Journal of artificial intelligence research, 12 149–198.
- Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35 1798–1828.
- Offline multi-task transfer rl with representational penalization. arXiv preprint arXiv:2402.12570.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 324–345.
- Language models are few-shot learners. Advances in neural information processing systems, 33 1877–1901.
- Sample complexity of multi-task reinforcement learning. arXiv preprint arXiv:1309.6821.
- Caruana, R. (1997). Multitask learning. Machine learning, 28 41–75.
- Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning. PMLR.
- Active multi-task representation learning. In International Conference on Machine Learning. PMLR.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Sharing knowledge in multi-task deep reinforcement learning. arXiv preprint arXiv:2401.09561.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- An information-theoretic analysis for thompson sampling with many actions. Advances in Neural Information Processing Systems, 31.
- Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434.
- Pg-ts: Improved thompson sampling for logistic contextual bandits. Advances in neural information processing systems, 31.
- Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning. PMLR.
- Parametric bandits: The generalized linear case. Advances in neural information processing systems, 23.
- A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In International Conference on Machine Learning. PMLR.
- Exploiting correlation to achieve faster learning rates in low-rank preference bandits. arXiv preprint arXiv:2202.11795.
- Near-optimal representation learning for linear bandits and linear rl. In International Conference on Machine Learning. PMLR.
- Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401.
- Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 30.
- Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on learning theory. PMLR.
- Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning. PMLR.
- Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504.
- The benefit of multitask representation learning. Journal of Machine Learning Research, 17 1–32.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35 27730–27744.
- Language models are unsupervised multitask learners. OpenAI blog, 1 9.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26.
- Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 1221–1243.
- Saha, A. (2021). Optimal algorithms for stochastic contextual preference bandits. Advances in Neural Information Processing Systems, 34 30050–30062.
- Active ranking with subset-wise preferences. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR.
- Pac battling bandits in the plackett-luce model. In Algorithmic Learning Theory. PMLR.
- Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory. PMLR.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33 3008–3021.
- Distral: Robust multitask reinforcement learning. Advances in neural information processing systems, 30.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58 267–288.
- Provable meta-learning of linear representations. In International Conference on Machine Learning. PMLR.
- Vershynin, R. (2020). High-dimensional probability. University of California, Irvine, 10 11.
- Improved active multi-task representation learning via lasso. In International Conference on Machine Learning. PMLR.
- Multi-task reinforcement learning: a hierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.
- The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78 1538–1556.
- Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the 26th Annual International Conference on Machine Learning.
- Beat the mean bandit. In Proceedings of the 28th international conference on machine learning (ICML-11). Citeseer.
- Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922.
- Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In International Conference on Machine Learning. PMLR.
- Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining.
- Ruitao Chen (9 papers)
- Liwei Wang (239 papers)