Papers
Topics
Authors
Recent
2000 character limit reached

Soft Condorcet Optimization for Ranking of General Agents (2411.00119v4)

Published 31 Oct 2024 in cs.MA and cs.LG

Abstract: Driving progress of AI models and agents requires comparing their performance on standardized benchmarks; for general agents, individual performances must be aggregated across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59\% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. A. Ali and M. Meila. Experiments with kemeny ranking: What works when? Mathematical Social Sciences, 64(1):28–40, 2012.
  2. M. Alvo and P. L. H. Yu. Statistical Methods for Ranking Data. Springer New York, Sept. 2014.
  3. Possible and necessary winners of partial tournaments. Journal of Artificial Intelligence Research, 54:493–534, 2015.
  4. Deepconsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nature Biotechnology, 41(2):232–238, 2023.
  5. Re-evaluating evaluation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  6. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013.
  7. B. bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj.
  8. Learning with differentiable pertubed optimizers. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9508–9519. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/6bb56208f672af0dd65451f869fedfd9-Paper.pdf.
  9. On the limitations of the elo, Real-World games are transitive, not additive. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2905–2921. PMLR, 2023.
  10. Learning with Fenchel-Young losses. The Journal of Machine Learning Research, 21(1):1314–1382, 2020a.
  11. Fast differentiable sorting and ranking. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 950–959. PMLR, 13–18 Jul 2020b. URL https://proceedings.mlr.press/v119/blondel20a.html.
  12. Elo uncovered: Robustness and best practices in language model evaluation. Nov. 2023.
  13. Rank analysis of incomplete block designs I: The method of paired comparisons. Biometrika, 39:324–345, 1952.
  14. Handbook of Computational Social Choice. Cambridge University Press, 2016.
  15. Deep blue. Artificial Intelligence, 134(1–2):57–83, 2002.
  16. Self-supervised learning of audio representations from permutations with differentiable ranking. IEEE Signal Processing Letters, 28:708–712, 2021.
  17. Chatbot arena: An open platform for evaluating llms by human preference. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of PLMR, 2024.
  18. Sat-based judgment aggregation. In International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2022.
  19. V. Conitzer and T. Sandholm. Common voting rules as maximum likelihood estimators. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), pages 145–152, 2005.
  20. Position: Social choice should guide AI alignment in dealing with diverse human feedback. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 9346–9360. PMLR, 21–27 Jul 2024. URL https://proceedings.mlr.press/v235/conitzer24a.html.
  21. Differentiable patch selection for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 2351–2360, 2021.
  22. R. Coulom. Bayesian Elo rating, 2005. https://www.remi-coulom.fr/Bayesian-Elo.
  23. J. Cramer. The origins of logistic regression. Technical report, 2002.
  24. Probability models on rankings. Journal of mathematical psychology, 35(3):294–318, Sept. 1991.
  25. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in Neural Information Processing Systems, 26, 2013.
  26. M. Cuturi and M. Blondel. Soft-DTW: a differentiable loss function for time-series. In International Conference on Machine Learning, pages 894–903, 2017.
  27. M. de Condorcet. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix, 1785.
  28. P. Diaconis. Group representations in probability and statistics, volume 11 of Lecture Notes-Monograph Series. Institute of Mathematical Statistics, Jan. 1988.
  29. Fast stochastic composite minimization and an accelerated frank-wolfe algorithm under parallelization. In Advances in Neural Information Processing Systems, 2022.
  30. A. E. Elo. The Ratings of Chess Players, Past and Present. Arco Publishing, Inc., 2nd edition, 1978.
  31. Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022. 10.1126/science.ade9097. URL https://www.science.org/doi/abs/10.1126/science.ade9097.
  32. Distance based ranking models. Journal of the Royal Statistical Society. Series B, Statistical methodology, 48(3):359–369, 1986.
  33. E. J. Gumbel. Statistical Theory of Extreme Values and some Practical Applications: A Series of Lectures. Number 33. US Govt. Print. Office, 1954.
  34. E. Hazan. Introduction to online convex optimization. Now Publishers Inc., 2015. Foundations and Trends in Optimization, Volume 2, Issue 3-4.
  35. Perturbations, Optimization, and Statistics. MIT Press, 2016.
  36. Trueskill™: A Bayesian skill rating system. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006.
  37. D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 32(1):384–406, 2004.
  38. M. Karpinski and W. Schudy. Faster algorithms for feedback arc set tournament, kemeny rank aggregation and betweenness tournament. ISAAC, 6506:3–14, 2010.
  39. The uci machine learning repository. https://archive.ics.uci.edu.
  40. J. Kemeny. Mathematics without numbers. Daedalus, 88(4):577–591, 1959.
  41. C. Kenyon-Mathieu and W. Schudy. How to rank with few errors. In Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, STOC ’07, page 95–103, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595936318. 10.1145/1250790.1250806. URL https://doi.org/10.1145/1250790.1250806.
  42. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, pages 1097–1105, 2012.
  43. Groomed-nms: Grouped mathematically differentiable nms for monocular 3d object detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 8973–8983, 2021.
  44. Evaluating agents using social choice theory, 2023.
  45. Differentiable rendering with perturbed optimizers. Advances in Neural Information Processing Systems, 34:20398–20409, 2021.
  46. Holistic evaluation of language models. CoRR, abs/2211.09110, 2022. See latest results at: https://crfm.stanford.edu/helm/latest/.
  47. T.-Y. Liu. Learning to Rank for Information Retrieval. Springer, 2011.
  48. Agentbench: Evaluating llms as agents, 2023.
  49. Deep embedding and alignment of protein sequences. Nature Methods, 20(1):104–111, 2023.
  50. A variational analysis of stochastic gradient algorithms. In International conference on machine learning, pages 354–363. PMLR, 2016.
  51. J. I. Marden. Analyzing and Modeling Rank Data. Chapman and Hall, 1995.
  52. N. Mattei and T. Walsh. Preflib: A library of preference data http://preflib.org. In Proceedings of the 3rd International Conference on Algorithmic Decision Theory (ADT 2013), Lecture Notes in Artificial Intelligence. Springer, 2013.
  53. A. Mensch and M. Blondel. Differentiable dynamic programming for structured prediction and attention. In Proc. of ICML, 2018.
  54. S. Morse. Elo as a statistical learning model, 2019. https://stmorse.github.io/journal/Elo.html. Retrieved January 2024.
  55. K. P. Murphy. Probabilistic Machine Learning: An introduction. MIT Press, 2022. URL probml.ai.
  56. Gradient estimation with stochastic softmax tricks. Advances in Neural Information Processing Systems, 33:5691–5704, 2020.
  57. Possible and necessary winners in voting trees: majority graphs vs. profiles. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 311–318, 2011.
  58. Ranking from stochastic pairwise preferences: Recovering condorcet winners and tournament solution sets at the top. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 665–673, 2015.
  59. Vote’n’rank: Revision of benchmarking with social choice theory. In A. Vlachos and I. Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 670–686, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. 10.18653/v1/2023.eacl-main.48. URL https://aclanthology.org/2023.eacl-main.48.
  60. A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 44:206–226, 1959.
  61. Fast, differentiable and sparse top-k: a convex analysis perspective. arXiv preprint arXiv:2302.01425, 2023.
  62. Simple, robust and optimal ranking from pairwise comparisons. Journal of machine learning research: JMLR, 18(199):1–38, 2018.
  63. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484–489, 2016.
  64. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 632(6419):1140–1144, 2018.
  65. J. H. Smith. Aggregation of preferences with variable electorate. Econometrica, 41:1027–1041, 1973.
  66. Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18(134):1–35, 2017.
  67. Differentiable clustering with perturbed spanning forests. In Advances in Neural Information Processing Systems, 2022.
  68. A. Syed, Omar; Syed. Arimaa – a new game designed to be difficult for computers. International Computer Games Association Journal, 26:138–139.
  69. G. Tesauro. Temporal difference learning and TD-gammon. Communications of the ACM, 38(3):58–68, 1995.
  70. M. Udell. Sigmoidalprogramming (julia package), 2020. https://github.com/madeleineudell/SigmoidalProgramming.jl. Retrieved August 2024.
  71. M. Udell and S. P. Boyd. Maximizing a sum of sigmoids. 2014. URL https://api.semanticscholar.org/CorpusID:18061910.
  72. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019. 10.1038/s41586-019-1724-z. URL https://doi.org/10.1038/s41586-019-1724-z.
  73. Differentiation of blackbox combinatorial solvers. arXiv preprint arXiv:1912.02175, 2019.
  74. webDiplomacy Development Team. webdiplomacy. https://webdiplomacy.net/. Accessed April 2023.
  75. L. Xia. Learning and Decision-Making from Rank Data. Synthesis Lectures on Artificial Intelligence and Machine Learning. Springer, 2019.
  76. L. Xia and V. Conitzer. Determining possible and necessary winners given partial orders. Journal of Artificial Intelligence Research, 41:25–67, 2011.
  77. H. P. Young and A. Levenglick. A consistent extension of Condorcet’s election principle. SIAM Journal of Applied Mathematics, 35(2):285–300, 1978.
  78. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. See also https://lmsys.org/blog/2023-05-03-arena/ and https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard. Accessed August 13th, 2023.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com
Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 54 likes about this paper.