Selection effects from baseline filtering and incentives
Determine the selection effects on the distribution of human baseline success times caused by incentivizing baseliners to give up early and by filtering task duration estimates to only successful baseline runs when measuring task length for HCAST, RE-Bench, and SWAA tasks.
References
From manual reviews of baseline attempts, we also observe that humans sometimes simply give up even when the task seems within their capabilities, and it is unclear what selection effects are produced on the distribution of success times as a result.
— Measuring AI Ability to Complete Long Tasks
(2503.14499 - Kwa et al., 18 Mar 2025) in Section “Limitations and future work,” subparagraph “More rigorous human baselining”