Determine causes of Elo discrepancy against humans vs. bots on Lichess

Ascertain the precise factors that cause the observed discrepancy between the Lichess blitz Elo ratings achieved by the 270M-parameter transformer action-value policy when playing exclusively against human opponents versus when playing against bots, rigorously evaluating the extent to which resignation behavior, rating-pool miscalibration between humans and bots, and differential exploitation of occasional tactical mistakes by bots contribute to the difference.

Background

The authors report a substantial difference between the Lichess blitz Elo achieved by their strongest model when playing exclusively against humans versus when playing primarily against bots. They note that the reasons for this discrepancy are not fully understood and offer several hypotheses that could explain it.

Understanding and quantifying the causes of this discrepancy would clarify how evaluation conditions (human vs. bot opponents), resignation behavior, rating-pool calibration, and opponent-specific exploitation of weaknesses affect measured playing strength, thereby informing more robust evaluation protocols for search-free chess agents.

References

While the precise reasons are not entirely clear, we have three plausible hypotheses: (i) humans tend to resign when our bot has overwhelming win percentage but many bots do not (meaning that the previously described problem gets amplified when playing against bots); (ii) humans on Lichess rarely play against bots, meaning that the two player pools (humans and bots) are hard to compare and Elo ratings between pools may be miscalibrated~\citep{justaz2023exact}; and (iii) based on preliminary (but thorough) anecdotal analysis by a chess NM, our models make the occasional tactical mistake which may be penalized qualitatively differently (and more severely) by other bots compared to humans (see some of this analysis in \cref{ssec:tactics-analysis,ssec:playing-style-analysis}).

Amortized Planning with Large-Scale Transformers: A Case Study on Chess (2402.04494 - Ruoss et al., 7 Feb 2024) in Section 6 (Discussion), paragraph "Elo: Humans vs. bots"