Unknown variance of human performance on Vending-Bench
Measure and characterize the variance of human performance on Vending-Bench through multiple human runs to enable rigorous comparison with model variability.
References
We only have a single sample for the human baseline and therefore cannot compare variances.
— Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents
(Backlund et al., 20 Feb 2025) in Comparison to human baseline (Subsection 3.6; labeled Section \ref{sec:human_baseline})