Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (1709.06009v2)

Published 18 Sep 2017 in cs.LG

Abstract: The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.

View on arXiv

Authors (6)

Marlos C. Machado (40 papers)
Marc G. Bellemare (57 papers)
Erik Talvitie (5 papers)
Joel Veness (29 papers)
Matthew Hausknecht (26 papers)
Michael Bowling (67 papers)

Citations (524)

View on Semantic Scholar

Summary

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

The paper, "Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents," offers a comprehensive overview of the Arcade Learning Environment (ALE), an evaluation platform designed for testing general artificially intelligent agents across a diverse range of Atari 2600 games. This platform encourages the development of agents that need to operate effectively in a variety of environments without relying on game-specific information.

Key Contributions

The authors address several methodological discrepancies that have arisen as the ALE has become a popular benchmark in the AI community. They argue for more uniform evaluation methods to enable accurate comparisons of learning algorithms. This involves advocating for consistent protocol specifications such as episode termination conditions, the setting of hyperparameters, and the measurement of training data. They also introduce an updated version of the ALE that integrates stochasticity through what they term "sticky actions."

Simultaneously, the paper reflects on the progress made in the field and outlines open problems that persist, such as:

Representation Learning: While DQN and similar algorithms demonstrate that it is possible to learn representations along with control policies, these approaches are sample-inefficient and performance gains are not uniformly significant across all games.
Exploration Challenges: Some games, like "Montezuma's Revenge" and "Private Eye," remain challenging due to sparse reward structures that demand enhanced exploration strategies.
Model Learning and Planning: Current approaches to model learning face difficulties with accuracy over extended time frames, as demonstrated by compounding prediction errors in learned models.
Transfer Learning: Despite attempts at applying transfer learning in the ALE, achieving significant gains remains difficult due to varying inter-game dynamics.
Off-Policy Learning: This area remains less explored, with challenges related to algorithm stability and convergence in stochastic environments.

Evaluation and Methodology

The introduction of sticky actions serves as a mechanism to inject stochasticity, mitigating the risk of agents exploiting deterministic game dynamics through memorization instead of true policy learning. Evaluation involves measuring agents' robustness to stochastic perturbations, thus ensuring they learn adaptable, closed-loop policies. This paper recommends reporting performance over training at intervals (such as 10, 50, 100, and 200 million frames) to provide insight into learning dynamics, including stability and sample efficiency.

Benchmark results presented in the paper utilize contemporary algorithms like DQN and Sarsa( $\lambda$ ) with Blob-PROST features, offering a standard against which future research can be compared. These results also help highlight the nuances in learning performance across different games and further validate the proposed methodological refinements.

Implications and Future Directions

From a practical standpoint, the methodological recommendations and enhanced ALE version aim to refine the evaluation standards of AI agents, thus ultimately leading to more reliable agent comparisons. The authors encourage the community to adopt these benchmarks and methodologies to ensure consistency and to facilitate meaningful progress in AI research. The introduction of flavor variations in game modes and difficulty levels opens new avenues for research in transfer learning and other domains.

In conclusion, while substantial advancements have been made, the outlined open problems indicate a significant potential for continued research in these areas. As new approaches are developed and tested against these updated benchmarks, they will present critical insights into building more robust, flexible, and general AI systems.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos