Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies (2402.12673v1)

Published 20 Feb 2024 in cs.LG

Abstract: In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond only worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic hardness in achieving sublinear regret when the baseline policy is from a general continuous policy class, $\Pi$. This finding prompts us to \textit{refine} the baseline policy class $\Pi$ prior to test time, aiming for efficient adaptation within a finite policy class $\Tilde{\Pi}$, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite $\Tilde{\Pi}$, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal $\Tilde{\Pi}$, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  2. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  3. Learning all optimal policies with multiple criteria. In Proceedings of the 25th international conference on Machine learning, pp.  41–47, 2008.
  4. Vulnerability of deep reinforcement learning to policy induction attacks. In Machine Learning and Data Mining in Pattern Recognition: 13th International Conference, MLDM 2017, New York, NY, USA, July 15-20, 2017, Proceedings 13, pp.  262–275. Springer, 2017.
  5. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  6. Prediction, learning, and games. Cambridge university press, 2006.
  7. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
  8. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.  1126–1135. PMLR, 2017.
  9. Online meta-learning. In International Conference on Machine Learning, pp.  1920–1930. PMLR, 2019.
  10. Adversarial policies: Attacking deep reinforcement learning. arXiv preprint arXiv:1905.10615, 2019.
  11. James Hannan. Approximation to bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
  12. Online robust policy learning in the presence of unknown adversaries. Advances in neural information processing systems, 31, 2018.
  13. Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  14. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
  15. Deceptive reinforcement learning under adversarial manipulations on cost signals. In Decision and Game Theory for Security: 10th International Conference, GameSec 2019, Stockholm, Sweden, October 30–November 1, 2019, Proceedings 10, pp.  217–237. Springer, 2019.
  16. How to train your robot with deep reinforcement learning: lessons we have learned. The International Journal of Robotics Research, 40(4-5):698–721, 2021.
  17. Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, pp.  4860–4869. PMLR, 2020.
  18. Adaptive weighted sum method for multiobjective optimization: a new method for pareto front generation. Structural and multidisciplinary optimization, 31(2):105–116, 2006.
  19. Multi-objective optimization using genetic algorithms: A tutorial. Reliability engineering & system safety, 91(9):992–1007, 2006.
  20. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
  21. One solution is not all you need: Few-shot extrapolation via structured maxent rl. Advances in Neural Information Processing Systems, 33:8198–8210, 2020.
  22. Bandit algorithms. Cambridge University Press, 2020.
  23. Query-based targeted action-space adversarial policies on deep reinforcement learning agents. In Proceedings of the ACM/IEEE 12th International Conference on Cyber-Physical Systems, pp.  87–97, 2021.
  24. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  25. Efficient adversarial training without attacking: Worst-case-aware robust reinforcement learning. arXiv preprint arXiv:2210.05927, 2022.
  26. On gradient descent ascent for nonconvex-concave minimax problems. arXiv preprint arXiv:1906.00331, 2019a.
  27. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pp.  6083–6093. PMLR, 2020.
  28. Tactics of adversarial attack on deep reinforcement learning agents, 2019b.
  29. Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707, 2016.
  30. Sequential approximate multiobjective optimization using computational intelligence. Springer Science & Business Media, 2009.
  31. Dynamic preferences in multi-criteria reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp.  601–608, 2005.
  32. Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. Advances in Neural Information Processing Systems, 28, 2015.
  33. Robust deep reinforcement learning through adversarial loss, 2020.
  34. R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  35. Characterizing attacks on deep reinforcement learning. arXiv preprint arXiv:1907.09470, 2019.
  36. Robust deep reinforcement learning with adversarial attacks, 2017.
  37. Robust adversarial reinforcement learning. In International Conference on Machine Learning, pp.  2817–2826, 2017.
  38. Understanding adversarial attacks on observations in deep reinforcement learning, 2021.
  39. Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning. In International Conference on Machine Learning, pp.  7974–7984. PMLR, 2020.
  40. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113, 2013.
  41. Tim Roughgarden. Algorithmic game theory. Communications of the ACM, 53(7):78–86, 2010.
  42. Optimal attacks on reinforcement learning policies. arXiv preprint arXiv:1907.13548, 2019.
  43. An information-theoretic analysis of thompson sampling. The Journal of Machine Learning Research, 17(1):2442–2471, 2016.
  44. Jürgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
  45. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pp.  2898–2933, 2019.
  46. Vulnerability-aware poisoning mechanism for online rl with unknown dynamics. arXiv preprint arXiv:2009.00774, 2020.
  47. Who is the strongest enemy? towards optimal and efficient evasion attacks in deep rl. arXiv preprint arXiv:2106.05087, 2021.
  48. Robustifying reinforcement learning agents via action space adversarial training. In 2020 American control conference (ACC), pp.  3959–3964. IEEE, 2020.
  49. Action robust reinforcement learning and applications in continuous control. In International Conference on Machine Learning, pp.  6215–6224. PMLR, 2019.
  50. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
  51. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. Advances in neural information processing systems, 32, 2019.
  52. Discovering a set of policies for the worst case reward. arXiv preprint arXiv:2102.04323, 2021.
  53. Robust deep reinforcement learning against adversarial perturbations on observations. 2020a.
  54. Robust reinforcement learning on state observations with learned optimal adversary. arXiv preprint arXiv:2101.08452, 2021.
  55. Adaptive reward-poisoning attacks against reinforcement learning. In International Conference on Machine Learning, pp.  11225–11234. PMLR, 2020b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiangyu Liu (47 papers)
  2. Chenghao Deng (7 papers)
  3. Yanchao Sun (32 papers)
  4. Yongyuan Liang (18 papers)
  5. Furong Huang (150 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com