Curiosity-driven Exploration by Self-supervised Prediction
This paper explores the issue of reinforcement learning (RL) in environments where external rewards are sparse or nonexistent. The authors propose a novel approach to leverage curiosity as an intrinsic reward signal to guide exploratory behavior. Traditionally, RL relies on external rewards for policy updates, but in many real-world scenarios, these rewards are inadequately sparse, necessitating an alternative mechanism for learning effective behaviors. The authors approach this problem by redefining curiosity in terms of prediction errors within a self-supervised framework.
Methodology
The proposed method formulates curiosity as the agent's error in predicting the outcome of its actions. This prediction is done not in the raw pixel space but in a learned feature space derived from a self-supervised inverse dynamics model. This feature space encompasses aspects of the environment relevant to the agent's actions while ignoring irrelevant factors. By avoiding direct pixel prediction, this approach addresses key challenges in high-dimensional continuous state spaces, making it robust to environmental complexities and stochasticity.
The primary components of the methodology include:
- Learning Feature Space: Using self-supervised learning, a neural network is trained to predict the agent's actions given its current and subsequent states. This network learns a feature embedding that abstracts away irrelevant aspects of the environment.
- Forward Dynamics Model: Another model is trained to predict the next state in the learned feature space using the current state and the chosen action. The intrinsic reward signal is derived from the prediction error of this forward model, effectively guiding the agent's curiosity-driven exploration.
- Policy Optimization: An asynchronous advantage actor critic (A3C) reinforcement learning algorithm is employed to optimize the policy, leveraging both the intrinsic curiosity rewards and any infrequent extrinsic rewards.
Experimental Setup and Results
The authors evaluated their approach in two distinct environments: VizDoom, a 3D navigation task, and Super Mario Bros, a side-scrolling game. They considered three settings:
- Sparse Extrinsic Reward: In VizDoom, where the goal is to navigate a complex environment to find a reward. Their agent outperformed the baseline A3C in navigating efficiently towards the goal even with very sparse rewards, demonstrating better exploration capabilities.
- No Extrinsic Reward: The methodology was assessed on its own merit where no environmental rewards were provided. The agent still learned to explore effectively, covering significant portions of the environment and discovering behaviors like avoiding obstacles in Mario without any explicit rewards.
- Generalization to Novel Scenarios: The approach was tested on its ability to generalize learned exploratory behaviors to new, unseen environments. In VizDoom, the agent trained only on an exploration policy showed superior performance when fine-tuned on novel maps with different textures. Similarly, in Mario, the agent transferred knowledge from Level-1 to subsequent levels, performing better than one trained from scratch.
Implications and Future Directions
The implications of this work are manifold:
- Scalability: The proposed method scales effectively to environments where traditional pixel-based prediction fails, making it applicable in real-world RL scenarios with complex, high-dimensional observations.
- Robustness: By focusing on regions of the environment that can affect the agent, the method is robust to nuisance factors, ensuring stable learning trajectories even in the presence of environmental stochasticity.
- Unsupervised Skill Discovery: The ability to learn useful exploratory behaviors without explicit rewards suggests potential for unsupervised skill discovery, paving the way for autonomous agents capable of pre-training in generic environments before task-specific fine-tuning.
Future research could explore the integration of this curiosity module with hierarchical RL frameworks, enhancing performance on more complex tasks by leveraging pre-learned behaviors as building blocks. Another promising direction is the application to transfer learning scenarios, where the agent can generalize its learned exploration strategies to entirely new domains, potentially reducing the need for extensive environmental sampling in each new task. Long-term, this line of work moves towards achieving autonomous agents that can independently explore and adapt to a wide array of real-world applications.