Estimate the fraction of apathetic votes on Chatbot Arena

Determine the fraction r of apathetic users on Chatbot Arena who submit random or low-quality preference votes, in order to quantify how prevalent apathetic voting is in the platform’s human preference dataset and to assess its impact on leaderboard reliability.

Background

The paper investigates the reliability of open community-driven platforms like Chatbot Arena that collect pairwise preference judgments to rank LLMs. One hypothesized source of poor-quality annotations is apathetic voting, where un-incentivized users submit random or low-quality votes. The authors simulate the effect of injecting random labels into the dataset and show that even 10% apathetic votes can shift model ranks by up to 5 places.

Critically, the authors note that there are no existing studies characterizing typical user incentives or behaviors on such platforms, leading to an explicit inability to estimate the fraction of apathetic users. Without an estimate of r, it is challenging to calibrate quality-control mechanisms or to understand the extent to which leaderboard rankings might be distorted by apathetic voting.

References

Note that there are no existing studies characterizing the incentives or behaviors of an average user on open platforms like Chatbot Arena. Therefore, we have no way of estimating the fraction r of apathetic.

Challenges in Trustworthy Human Evaluation of Chatbots  (2412.04363 - Zhao et al., 2024) in Section 3.1 (Apathetic Voting), Results paragraph after Table 1