Selection effects from baseline filtering and incentives

Determine the selection effects on the distribution of human baseline success times caused by incentivizing baseliners to give up early and by filtering task duration estimates to only successful baseline runs when measuring task length for HCAST, RE-Bench, and SWAA tasks.

Background

The paper’s methodology estimates task difficulty using the geometric mean of successful human baseline runs. Baseliners were encouraged to give up on tasks they might not complete quickly, and only successful attempts were used to compute task length. The authors note that this selection may bias task duration estimates, particularly for longer tasks where human success rates are low.

They further discuss in the appendix that conditioning on success likely biases durations downward and may underestimate model performance and the pace of improvement. However, the specific selection effects produced by this incentive scheme and filtering choice remain uncharacterized.

References

From manual reviews of baseline attempts, we also observe that humans sometimes simply give up even when the task seems within their capabilities, and it is unclear what selection effects are produced on the distribution of success times as a result.

— Measuring AI Ability to Complete Long Tasks (2503.14499 - Kwa et al., 18 Mar 2025) in Section “Limitations and future work,” subparagraph “More rigorous human baselining”

Selection effects from baseline filtering and incentives

Sponsor

Background

References

Related Problems