Dice Question Streamline Icon: https://streamlinehq.com

Turning video games into effective LLM benchmarks

Determine whether existing video game environments can be transformed into effective, reliable benchmarks for evaluating large language models, such that benchmark performance is discriminative and robust despite brittle vision perception, prompt sensitivity, and potential training-data contamination.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper argues that directly evaluating LLMs by placing them into popular video games yields poor and noisy performance due to brittle vision perception, prompt sensitivity, and potential data contamination. As a result, scores often cluster near random baselines, making it difficult to assess model abilities reliably.

This motivates a broader inquiry into whether and how game environments themselves can be adapted, scaffolded, or standardized to serve as rigorous and effective benchmarks for LLMs. The authors subsequently propose LMGame-Bench as one approach, but the general question of transforming games into dependable evaluations is posed explicitly as open.

References

This leaves an open question: can we turn games into more effective benchmarks for evaluating LLMs?

lmgame-Bench: How Good are LLMs at Playing Games? (2505.15146 - Hu et al., 21 May 2025) in Section 1 (Introduction)