Effect of training‑data contamination on LLM gaming performance

Ascertain the extent to which pretraining-data contamination—specifically, exposure to video game assets and solutions during pretraining—affects large language model performance on video‑game evaluation tasks, and determine whether high scores reflect memorization of contaminated content rather than genuine perception, reasoning, and planning.

Background

Because LMGame-Bench reuses well-known games, many visual and textual assets may already appear in model pretraining corpora. If models have seen these assets or scripts, their evaluation scores could reflect recall rather than skill, jeopardizing the validity of the benchmark.

The authors note this uncertainty explicitly and later investigate contamination across vision (Super Mario Bros.) and text (Ace Attorney), proposing mitigation strategies; nonetheless, the general problem of quantifying and attributing performance shifts due to contamination is stated as unclear.

References

It's also unclear the effect of data contamination on gaming performance since the models might have seen numerous gaming assets during pre-training.

lmgame-Bench: How Good are LLMs at Playing Games? (2505.15146 - Hu et al., 21 May 2025) in Section 1 (Introduction)