Find Open Problems
Find Open Problems
Search for open problems in our database
Submit a Problem
Submit a new open problem to our database (not available yet)
Dice Question Streamline Icon: https://streamlinehq.com

Unknown variance of human performance on Vending-Bench

Measure and characterize the variance of human performance on Vending-Bench through multiple human runs to enable rigorous comparison with model variability.

References

We only have a single sample for the human baseline and therefore cannot compare variances.

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (Backlund et al., 20 Feb 2025) in Comparison to human baseline (Subsection 3.6; labeled Section \ref{sec:human_baseline})