Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 27 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 163 tok/s Pro
2000 character limit reached

Provable Interactive Learning with Hindsight Instruction Feedback (2404.09123v1)

Published 14 Apr 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight instruction where a teacher provides an instruction that is most suitable for the agent's generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent's response space. We then study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that its regret scales as $\sqrt{T}$ where $T$ is the number of rounds and depends on the intrinsic rank but does not depend on the size of the agent's response space. We provide experiments in two domains showing that LORIL outperforms baselines even when the low-rank assumption is violated.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR, 2014.
  4. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
  5. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pages 127–135. PMLR, 2013.
  6. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  7. Logarithmic online regret bounds for undiscounted reinforcement learning. Advances in neural information processing systems, 19, 2006.
  8. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  9. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19–26. JMLR Workshop and Conference Proceedings, 2011.
  10. Learning to map natural language instructions to physical quadcopter control using simulated flight. In 3rd Annual Conference on Robot Learning, CoRL 2019, Osaka, Japan, October 30 - November 1, 2019, Proceedings, 2019.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  12. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
  13. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  14. Tight guarantees for interactive decision making with the decision-estimation coefficient. arXiv preprint arXiv:2301.08215, 2023.
  15. David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
  16. Speaker-follower models for vision-and-language navigation. Advances in Neural Information Processing Systems, 31, 2018.
  17. Sara A Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  19. Exploration via elliptical episodic bonuses. Advances in Neural Information Processing Systems, 35:37631–37646, 2022.
  20. Grounded language learning fast and slow. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=wpSWuz_hyqA.
  21. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  22. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418, 2021.
  23. Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209–232, 2002.
  24. Generalized hindsight for reinforcement learning. Advances in neural information processing systems, 33:7754–7767, 2020.
  25. Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In Proceedings of the AAAI Conference on Artificial Intelligence, 2016.
  26. Mapping instructions to actions in 3D environments with visual goal prediction. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.
  27. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
  28. Tell me dave: Context-sensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research, 35(1-3):281–300, 2016.
  29. Goal representations for instruction following: A semi-supervised language interface to control. In Conference on Robot Learning, pages 3894–3908. PMLR, 2023.
  30. Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE, 2018.
  31. Interactive learning from activity description. In International Conference on Machine Learning, pages 8096–8108. PMLR, 2021.
  32. Stochastic bandits with linear constraints. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2827–2835. PMLR, 13–15 Apr 2021. URL https://proceedings.mlr.press/v130/pacchiano21a.html.
  33. Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88–97, 1991.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  35. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702–11716, 2021.
  36. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
  37. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013.
  38. Agnostic reinforcement learning with low-rank mdps and rich observations. Advances in Neural Information Processing Systems, 34:19033–19045, 2021.
  39. Jack Sherman. Adjustment of an inverse matrix corresponding to changes in the elements of a given column or a given row of the original matrix. Annals of mathematical statistics, 20(4):621, 1949.
  40. Reinforcement learning: An introduction. MIT press, 2018.
  41. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  42. Terry Winograd. Understanding natural language. Cognitive psychology, 3(1):1–191, 1972.
  43. Making linear mdps practical via contrastive representation learning. In International Conference on Machine Learning, pages 26447–26466. PMLR, 2022.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.