Turning video games into effective LLM benchmarks
Determine whether existing video game environments can be transformed into effective, reliable benchmarks for evaluating large language models, such that benchmark performance is discriminative and robust despite brittle vision perception, prompt sensitivity, and potential training-data contamination.
References
This leaves an open question: can we turn games into more effective benchmarks for evaluating LLMs?
— lmgame-Bench: How Good are LLMs at Playing Games?
(2505.15146 - Hu et al., 21 May 2025) in Section 1 (Introduction)