Conditions on Preference Relations that Guarantee the Existence of Optimal Policies (2311.01990v2)
Abstract: Learning from Preferential Feedback (LfPF) plays an essential role in training LLMs, as well as certain types of interactive learning agents. However, a substantial gap exists between the theory and application of LfPF algorithms. Current results guaranteeing the existence of optimal policies in LfPF problems assume that both the preferences and transition dynamics are determined by a Markov Decision Process. We introduce the Direct Preference Process, a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. Within this framework, we establish conditions that guarantee the existence of optimal policies by considering the ordinal structure of the preferences. We show that a decision-making problem can have optimal policies -- that are characterized by recursive optimality equations -- even when no reward function can express the learning goal. These findings underline the need to explore preference-based learning strategies which do not assume that preferences are generated by reward.
- Advances in preference-based reinforcement learning: A review. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2527–2532.
- A definition of continual reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- On the expressivity of markov reward. Advances in Neural Information Processing Systems, 34:7799–7812.
- Direct preference-based policy optimization without reward modeling. In Thirty-seventh Conference on Neural Information Processing Systems.
- Constitutional ai: Harmlessness from ai feedback.
- Bellman, R. (1957). Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition.
- Quantifying hypothesis space misspecification in learning from human–robot demonstrations and physical corrections. IEEE Transactions on Robotics, 36(3):835–854.
- Settling the reward hypothesis. In International Conference on Machine Learning, pages 3003–3020. PMLR.
- Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research. Survey Certification.
- On the theory of reinforcement learning with once-per-episode feedback. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 3401–3412. Curran Associates, Inc.
- Deep reinforcement learning from human preferences. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Simple agent, complex environment: Efficient reinforcement learning with agent states. J. Mach. Learn. Res., 23:255:1–255:54.
- Hutter, M. (2016). Extreme state aggregation beyond markov decision processes. Theoretical Computer Science, 650:73–91. Algorithmic Learning Theory.
- Jensen, K. K. (2012). Unacceptable risks and the continuity axiom. Economics & Philosophy, 28(1):31–42.
- Beyond reward: Offline preference-guided policy optimization. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15753–15768. PMLR.
- Preference transformer: Modeling human preferences using transformers for rl. In The Eleventh International Conference on Learning Representations.
- Provably feedback-efficient reinforcement learning via active reward learning. Advances in Neural Information Processing Systems, 35:11063–11078.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In International Conference on Machine Learning.
- Reinforcement learning, bit by bit. Foundations and Trends® in Machine Learning, 16(6):733–865.
- Mitten, L. G. (1974). Preference order dynamic programming. Management Science, 21(1):43–46.
- Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
- OpenAI (2023). Gpt-4 technical report.
- Deep exploration via bootstrapped dqn. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. International Journal of Human-Computer Studies, 160:102772.
- Rational multi-objective agents must admit non-markov reward representations. In NeurIPS ML Safety Workshop.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Dueling rl: Reinforcement learning with trajectory preferences. In International Conference on Artificial Intelligence and Statistics, pages 6263–6289. PMLR.
- Utility theory for sequential decision making. In International Conference on Machine Learning, pages 19616–19625. PMLR.
- On the limitations of Markovian rewards to express multi-objective, risk-sensitive, and modal tasks. In Evans, R. J. and Shpitser, I., editors, Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pages 1974–1984. PMLR.
- Sobel, M. J. (1975). Ordinal dynamic programming. Management Science, 21(9):967–975.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA.
- Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131.
- Theory of games and economic behavior. Princeton University Press.
- Weng, P. (2011). Markov decision processes with ordinal rewards: Reference point-based preferences. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 21, pages 282–289.
- A survey of preference-based reinforcement learning methods. J. Mach. Learn. Res., 18(1):4945–4990.
- Preference-based reinforcement learning with finite-time guarantees. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
- Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 43037–43067. PMLR.
- Fine-tuning language models from human preferences.