DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards (2304.10770v2)

Published 21 Apr 2023 in cs.LG, cs.AI, cs.IT, and math.IT

Abstract: Exploration is a fundamental aspect of reinforcement learning (RL), and its effectiveness is a deciding factor in the performance of RL algorithms, especially when facing sparse extrinsic rewards. Recent studies have shown the effectiveness of encouraging exploration with intrinsic rewards estimated from novelties in observations. However, there is a gap between the novelty of an observation and an exploration, as both the stochasticity in the environment and the agent's behavior may affect the observation. To evaluate exploratory behaviors accurately, we propose DEIR, a novel method in which we theoretically derive an intrinsic reward with a conditional mutual information term that principally scales with the novelty contributed by agent explorations, and then implement the reward with a discriminative forward model. Extensive experiments on both standard and advanced exploration tasks in MiniGrid show that DEIR quickly learns a better policy than the baselines. Our evaluations on ProcGen demonstrate both the generalization capability and the general applicability of our intrinsic reward. Our source code is available at https://github.com/swan-utokyo/deir.

Authors (4)

Shanchuan Wan (1 paper)
Yujin Tang (31 papers)
Yingtao Tian (32 papers)
Tomoyuki Kaneko (5 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces DEIR, a method that leverages conditional mutual information to distinguish agent-induced novelty from stochastic environmental factors.
It uses a discriminative forward model with contrastive learning and integrates PPO to train in both MiniGrid and ProcGen environments.
Experimental results demonstrate DEIR's superior performance over traditional approaches in sparse, noisy, and high-dimensional reward settings.

Overview of DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards

The paper introduces a novel approach to enhance exploration within reinforcement learning (RL) frameworks, focusing on environments characterized by sparse extrinsic rewards. The authors propose DEIR (Discriminative-model-based Episodic Intrinsic Reward), which introduces a new method for intrinsic reward generation that is theoretically grounded in conditional mutual information. This approach aims to effectively distinguish between novelty introduced by the environment’s stochasticity and that by the agent's exploration actions. The DEIR method is evaluated against prominent existing intrinsic reward mechanisms including ICM, RND, NGU, and \text{NovelD}.

Methodology

DEIR is centered around intrinsic reward generation that scales observation novelty using a conditional mutual information term. This term associates observed novelty with the agent's actions, ostensibly filtering out novelty arising from environmental stochasticity. The formulation leverages a discriminative forward model inspired by contrastive learning, thereby endowing the model with an ability to discern between genuine and fake trajectories.

Intrinsic Reward Design: The intrinsic reward is formally derived by maximizing the distance between current and past observations, scaled by the mutual information between this distance and the agent's actions. This implies that the distance metric captures genuinely novel experiences attributed to the agent's actions.
Discriminative Model: A key component of DEIR, the discriminative model, is designed to estimate whether a given trajectory segment is genuine or fake, enhancing the RL agent's ability to identify and seek novel exploration paths. This is achieved through a novel network architecture combining CNNs, RNNs, and MLPs, specifically tailored for environments with partial observability. A queue of recent observations is maintained to ensure diverse and challenging negative samples for model training.
Training Regimen: The discriminative model and the agent’s policy are trained concurrently using PPO. The experiment setup involves extensive evaluations across procedurally generated environments such as MiniGrid and ProcGen, in both standard and more challenging "advanced" variations.

Experimental Evaluation

DEIR’s performance is scrutinized through rigorous experimentation across standard and advanced MiniGrid scenarios encompassing various levels of complexity and environmental challenges. Additionally, DEIR's generalization capability is tested in ProcGen environments, which present higher-dimensional observational spaces and procedurally generated challenges.

In MiniGrid, DEIR consistently outperforms existing methods in tasks with reduced observation spaces, noisy input data, and invisible obstacles. The discriminative model's ability to scale novelty with action-related relevance proves advantageous in isolating meaningful exploration rewards.
In ProcGen, DEIR showcases its versatility by achieving superior or on-par performance with traditional methods, demonstrating both the validity and adaptability of the designed intrinsic reward mechanism in more complex observational spaces.

Implications and Future Work

DEIR’s contribution lies in its effective decoupling of stochastic environmental features from legitimate explorative agent actions, thereby fostering more efficient policy learning. Practically, this can lead to improved exploration capabilities in RL applications with sparse rewards, particularly where manual dense reward shaping is impractical.

Theoretically, DEIR augments the understanding of exploration in RL by reinforcing the importance of conditional mutual information in intrinsic reward design. The experimentations highlight the potential of combining discriminative models with intrinsic motivation strategies.

Future explorations could extend DEIR’s applications into continuous action spaces and cooperative multi-agent environments. Also, integrating DEIR with lifelong learning frameworks presents a promising avenue to investigate whether the decoupling of stochastic and agent-induced novelties enhances sustained learning and policy optimization in dynamic environments. Additionally, adaption of the discriminative model to retain efficiency in real-time decision-making and environments with temporal constraints would be a valuable follow-up.

PDF Markdown

Related Papers

GitHub

GitHub - swan-utokyo/deir: DEIR: Efficient and Robust Exploration through Discriminative-Model-Based Episodic Intrinsic Rewards (15 stars)

YouTube

Show All Videos