Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback (2404.10776v1)

Published 16 Apr 2024 in cs.LG

Abstract: Learning from human feedback plays an important role in aligning generative models, such as LLMs (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandit (\algo), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $\tilde O(d\sqrt{T}+dC)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, and $ 0 \le C \le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Additionally, we conduct experiments to evaluate our proposed algorithm against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems.
  2. Instance-wise minimax-optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics. PMLR.
  3. Stochastic dueling bandits with adversarial corruption. In Algorithmic Learning Theory. PMLR.
  4. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory. JMLR Workshop and Conference Proceedings.
  5. Reducing dueling bandits to cardinal bandits. In International Conference on Machine Learning. PMLR.
  6. Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3 397–422.
  7. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 235–256.
  8. The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 48–77.
  9. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory. PMLR.
  10. Instance-dependent regret bounds for dueling bandits. In Conference on Learning Theory. PMLR.
  11. Stochastic contextual dueling bandits under linear stochastic transitivity models. In International Conference on Machine Learning. PMLR.
  12. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics. PMLR.
  13. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5 1–122.
  14. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory. JMLR Workshop and Conference Proceedings.
  15. Prediction, learning, and games. Cambridge university press.
  16. Variance-aware regret bounds for stochastic contextual dueling bandits. arXiv preprint arXiv:2310.00968 .
  17. Robust stochastic linear contextual bandits under adversarial attacks. In International Conference on Artificial Intelligence and Statistics. PMLR.
  18. Contextual dueling bandits. In Conference on Learning Theory. PMLR.
  19. Maxing and ranking with few assumptions. Advances in Neural Information Processing Systems 30.
  20. Improved optimistic algorithms for logistic bandits. In International Conference on Machine Learning. PMLR.
  21. Jointly efficient and optimal algorithms for logistic bandits. In International Conference on Artificial Intelligence and Statistics. PMLR.
  22. Parametric bandits: The generalized linear case. Advances in Neural Information Processing Systems 23.
  23. A relative exponential weighing algorithm for adversarial utility-based dueling bandits. In International Conference on Machine Learning. PMLR.
  24. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory. PMLR.
  25. Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in Neural Information Processing Systems 35 34614–34625.
  26. Approximate ranking from pairwise comparisons. In International Conference on Artificial Intelligence and Statistics. PMLR.
  27. Sparse dueling bandits. In Artificial Intelligence and Statistics. PMLR.
  28. Pac subset selection in stochastic multi-armed bandits. In ICML, vol. 12.
  29. Regret lower bound and optimal algorithm in dueling bandit problem. In Conference on learning theory. PMLR.
  30. Copeland dueling bandit problem: Regret lower bound, optimal algorithm, and computationally efficient algorithm. In International Conference on Machine Learning. PMLR.
  31. Best-of-both-worlds algorithms for linear contextual bandits. arXiv preprint arXiv:2312.15433 .
  32. Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. The annals of statistics 1091–1114.
  33. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 4–22.
  34. Bandit Algorithms. Cambridge University Press.
  35. Achieving near instance-optimality and minimax-optimality in stochastic and adversarial linear bandits simultaneously. In International Conference on Machine Learning. PMLR.
  36. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning. PMLR.
  37. Stochastic linear optimization with adversarial corruption. arXiv preprint arXiv:1909.02109 .
  38. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing.
  39. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 27730–27744.
  40. Dueling bandits: Beyond condorcet winners to general tournament solutions. Advances in Neural Information Processing Systems 29.
  41. Saha, A. (2021). Optimal algorithms for stochastic contextual preference bandits. Advances in Neural Information Processing Systems 34 30050–30062.
  42. Versatile dueling bandits: Best-of-both world analyses for learning from relative preferences. In International Conference on Machine Learning. PMLR.
  43. Adversarial dueling bandits. In International Conference on Machine Learning. PMLR.
  44. Efficient and optimal algorithms for contextual dueling bandits under realizability. In International Conference on Algorithmic Learning Theory. PMLR.
  45. Contextual bandits and imitation learning via preference-based active queries. arXiv preprint arXiv:2307.12926 .
  46. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory. PMLR.
  47. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning. PMLR.
  48. A model selection approach for corruption robust reinforcement learning. In International Conference on Algorithmic Learning Theory. PMLR.
  49. Double thompson sampling for dueling bandits. Advances in neural information processing systems 29.
  50. Borda regret minimization for generalized linear dueling bandits. arXiv preprint arXiv:2303.08816 .
  51. Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456 .
  52. Corruption-robust algorithms with uncertainty weighting for nonlinear contextual bandits and markov decision processes. In International Conference on Machine Learning. PMLR.
  53. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning. PMLR.
  54. The k-armed dueling bandits problem. Journal of Computer and System Sciences 78 1538–1556.
  55. Linear contextual bandits with adversarial corruptions. arXiv preprint arXiv:2110.12615 .
  56. The ingredients of real-world robotic reinforcement learning. arXiv preprint arXiv:2004.12570 .
  57. An optimal algorithm for stochastic and adversarial bandits. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR.
  58. Copeland dueling bandits. Advances in neural information processing systems 28.
  59. Relative upper confidence bound for the k-armed dueling bandit problem. In International conference on machine learning. PMLR.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Qiwei Di (6 papers)
  2. Jiafan He (27 papers)
  3. Quanquan Gu (198 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets