Variational Curriculum Reinforcement Learning for Unsupervised Discovery of Skills (2310.19424v1)
Abstract: Mutual information-based reinforcement learning (RL) has been proposed as a promising framework for retrieving complex skills autonomously without a task-oriented reward function through mutual information (MI) maximization or variational empowerment. However, learning complex skills is still challenging, due to the fact that the order of training skills can largely affect sample efficiency. Inspired by this, we recast variational empowerment as curriculum learning in goal-conditioned RL with an intrinsic reward function, which we name Variational Curriculum RL (VCRL). From this perspective, we propose a novel approach to unsupervised skill discovery based on information theory, called Value Uncertainty Variational Curriculum (VUVC). We prove that, under regularity conditions, VUVC accelerates the increase of entropy in the visited states compared to the uniform curriculum. We validate the effectiveness of our approach on complex navigation and robotic manipulation tasks in terms of sample efficiency and state coverage speed. We also demonstrate that the skills discovered by our method successfully complete a real-world robot navigation task in a zero-shot setup and that incorporating these skills with a global planner further increases the performance.
Summary
- The paper introduces the Variational Curriculum Reinforcement Learning (VCRL) framework and the Value Uncertainty Variational Curriculum (VUVC) algorithm for unsupervised skill discovery in RL, improving sample efficiency and state space coverage.
- VCRL unifies mutual information-based skill discovery methods by framing them as curriculum learning within goal-conditioned RL, while VUVC generates this curriculum by prioritizing goals based on value function uncertainty and visited state density.
- Experimental validation demonstrates VUVC's superior sample efficiency and exploration across various simulated tasks and shows successful zero-shot transfer of learned skills to a real-world robot.
The paper "Variational Curriculum Reinforcement Learning for Unsupervised Discovery of Skills" (2310.19424) introduces a framework and a specific algorithm for unsupervised skill discovery in reinforcement learning, aiming to improve sample efficiency and state space coverage by leveraging principles from information theory and curriculum learning.
The VCRL Framework: Unifying MI-based Skill Discovery and Curriculum GCRL
The work proposes the Variational Curriculum Reinforcement Learning (VCRL) framework, which recasts unsupervised skill discovery methods based on mutual information (MI) maximization as a form of curriculum learning within goal-conditioned reinforcement learning (GCRL). The starting point is the maximization of mutual information between a latent variable Z (representing skills or goals, denoted as g) and the resulting state S: I(S;Z). Optimizing the variational lower bound of this MI objective is commonly employed:
I(S;Z)≥Ep(g,s)[logqλ(g∣s)]−Ep(g)[logp(g)]=F(θ,λ)
where p(g,s)=p(g)ρπ(s∣g) with ρπ(s∣g) being the state distribution induced by the policy πθ conditioned on goal g, p(g) is the prior distribution over goals, and qλ(g∣s) is a variational approximation (discriminator) to the posterior p(g∣s).
VCRL establishes a connection by noting that maximizing the policy-dependent term Ep(g,s)[logqλ(g∣s)] with respect to the policy parameters θ is equivalent to a GCRL objective where logqλ(g∣s) serves as an intrinsic reward function rg(s,a)=logqλ(s′∣g). The choice of the discriminator qλ(g∣s) determines the shape of this intrinsic reward (e.g., Gaussian assumptions can lead to L2 distance rewards). Crucially, VCRL interprets the goal prior distribution p(g) not as a fixed target, but as the curriculum distribution. By carefully selecting or adapting p(g) over time, the learning process can be guided for improved efficiency.
This framework allows for classifying existing MI-based methods based on their choices for qλ(g∣s) and p(g). For instance, methods like RIG sample goals g from the density of visited states $p_t^{\mathrm{visited}(g)$, while Skew-Fit uses $p(g) \propto p_t^{\mathrm{visited}(g)^{\alpha}$ with α<0 to prioritize less-visited states. EDL uses a pre-learned exploration distribution. VCRL provides a unified perspective for analyzing and designing such curriculum strategies.
VUVC: Curriculum Generation via Value Uncertainty and State Density
Within the VCRL framework, the paper introduces a novel curriculum strategy called Value Uncertainty Variational Curriculum (VUVC). The core idea is to prioritize goals that offer the highest potential for learning, balancing exploration of novel states with exploitation of learnable regions. VUVC achieves this by defining the curriculum distribution p(g) as:
$p_t^{\mathrm{VUVC}(g) \propto U(g) p_t^{\mathrm{visited}(g)^{\alpha}$
where:
- U(g) is a measure of epistemic uncertainty about the value of pursuing goal g from the initial state s0. It is estimated using an ensemble of K value functions {Vψk}k=1K:
U(g)=Vark[Vψk(s0,g)]
Intuitively, high variance suggests disagreement among ensemble members, indicating high uncertainty and potentially significant learning opportunities. Proposition 1 provides a theoretical link, suggesting that maximizing U(g) relates to maximizing a lower bound on I(Vψ(s0,g);ψ∣s0,g), implying that goals with high uncertainty are informative about the value function parameters.
- ptvisited(g) is the density of achieved goals (or visited states mapped to the goal space) up to time t. This is typically estimated using a density model (e.g., a β-VAE) trained on data from the replay buffer.
- α∈[−1,0) is a hyperparameter inherited from Skew-Fit, controlling the emphasis on exploring low-density regions. α=0 recovers uniform sampling over visited states, while α=−1 strongly prioritizes the least visited states.
By combining uncertainty U(g) and skewed density $p_t^{\mathrm{visited}(g)^{\alpha}$, VUVC dynamically generates goals that are both in less-explored regions (due to the density term) and where the agent's value estimates are uncertain (due to the uncertainty term). Empirical analysis (Fig. 5) suggests U(g) tends to be higher for goals of intermediate difficulty – neither trivially easy nor currently unreachable – guiding the agent towards the frontier of its capabilities.
Theoretical Justification for Accelerated Entropy Increase
The paper provides theoretical arguments supporting VUVC's ability to accelerate learning, specifically in terms of increasing the entropy of the visited state distribution, a common measure of exploration quality.
- Proposition 2: Under the assumption that the policy π is optimal for the current intrinsic reward and that the uncertainty U(g) is negatively correlated with the log-density $\log p_t^{\mathrm{visited}(g)$ (an empirically observed trend), VUVC increases the expected entropy of visited states faster than a uniform curriculum. This leverages the intuition that prioritizing uncertain (often less visited) goals accelerates coverage.
- Proposition 3: This result is extended to suboptimal policies. If VUVC prioritizes "informative" goals (goals g where the policy π(⋅∣g) leads to states s′ with lower density ptvisited(s′)) over "uninformative" ones (where π(⋅∣g) leads to high-density states), then VUVC again accelerates the increase in visited state entropy compared to uniform sampling. The mechanism relies on the correlation between high uncertainty U(g) and the informativeness of goal g.
These propositions suggest that VUVC's combination of uncertainty and density skewing provides a principled mechanism for efficient exploration compared to simpler strategies like uniform or purely density-based sampling.
Implementation Details
Implementing VUVC involves integrating several components within an off-policy GCRL framework (e.g., based on HER with SAC or TD3):
- Goal-Conditioned Policy and Value Functions: A policy πθ(a∣s,g) and an ensemble of K Q-value functions {Qψk(s,a,g)}k=1K (and corresponding value functions Vψk(s,g)=Ea∼πθ(⋅∣s,g)[Qψk(s,a,g)]). The ensemble can be implemented using techniques like randomized prior functions or bootstrapping data.
- Discriminator/Intrinsic Reward: A discriminator qλ(g∣s) is trained concurrently to approximate p(g∣s). The intrinsic reward is rg(s′)=logqλ(g∣s′). Depending on the choice of qλ, this might involve a separate neural network. The paper utilizes a fixed Gaussian discriminator q(g∣s)=N(g∣ϕ(s),σ2I) where ϕ(s) is the state-to-goal mapping (often identity or a learned encoder for vision-based tasks), simplifying the reward to an L2 distance in goal space.
- Density Model: A generative model, such as a β-VAE, is trained on achieved goals (or encoded states) stored in the replay buffer to estimate $p_t^{\mathrm{visited}(g)$.
- Curriculum Sampling: At the beginning of each episode (or periodically), a batch of candidate goals is sampled (e.g., uniformly from the goal space or from the replay buffer). For each candidate goal g, uncertainty U(g) is computed using the value function ensemble, and density $p_t^{\mathrm{visited}(g)$ is queried from the density model. The goal g for the next episode is sampled according to $p_t^{\mathrm{VUVC}(g) \propto U(g) p_t^{\mathrm{visited}(g)^{\alpha}$.
- Training Loop: The process involves alternating steps of:
- Collecting experience using the policy πθ with goals sampled from $p_t^{\mathrm{VUVC}(g)$.
- Updating the policy πθ and Q-function ensembles {Qψk} using an off-policy algorithm (e.g., SAC) with HER and the intrinsic reward rg(s′).
- Updating the density model using data from the replay buffer.
Computational overhead mainly comes from maintaining the ensemble of Q-functions and training the density model. The uncertainty calculation and goal sampling add relatively minor costs per episode.
Experimental Validation
VUVC was evaluated on a range of challenging continuous control tasks:
- 2D Navigation: PointMaze environments of varying complexity.
- Robotic Manipulation (State-based): FetchPush, FetchPickAndPlace, FetchSlide with modified goal distributions.
- Robotic Manipulation (Vision-based): SawyerDoorHook, SawyerPickup, SawyerPush using latent representations from a pre-trained VAE.
- Robot Navigation (Sim-to-Real): Simulated Husky A200 environment and subsequent zero-shot transfer to a physical Husky robot.
Baselines: HER (standard GCRL), RIG, GoalGAN, DIAYN, EDL, Skew-Fit.
Key Findings:
- Sample Efficiency: VUVC demonstrated significantly improved sample efficiency compared to all baselines across most environments, achieving higher success rates faster (Figure 4). The improvement was particularly noticeable in complex mazes (PointMazeSquareLarge) and manipulation tasks.
- State Coverage: Qualitative visualizations of visited states/achieved goals (Figure 6, Appendix Figs. A.1-A.5) showed that VUVC explores the state space more rapidly and broadly compared to methods like Skew-Fit and GoalGAN. The curriculum distribution $p_t^{\mathrm{VUVC}(g)$ was observed to adapt over time, initially focusing on uncertain regions and gradually spreading out.
- Real-World Transfer: Skills learned by VUVC in simulation (goal-reaching policies for the Husky robot) successfully transferred zero-shot to the real world, enabling navigation to goals up to 22m away. Combining the learned VUVC skill (local policy) with a global A* planner further improved performance, allowing the robot to reach distant goals (31m) more efficiently by leveraging the skill for local navigation between waypoints. This highlights the practical utility of the learned skills as primitives for hierarchical planning.
Practical Implications and Applications
The VCRL framework and the VUVC algorithm offer several practical benefits for applying RL in robotics and other domains:
- Reduced Supervision: Enables the autonomous discovery of a diverse set of skills without task-specific reward engineering, reducing reliance on expert knowledge and manual curriculum design.
- Improved Sample Efficiency: Faster learning translates to reduced interaction time, crucial for real-world robotic applications where data collection is expensive or time-consuming.
- Enhanced Exploration: The principled combination of uncertainty and density promotes more systematic exploration, potentially leading to the discovery of a wider range of useful behaviors.
- Hierarchical Control: The learned goal-conditioned policies naturally serve as temporally extended actions or skills for higher-level planners or hierarchical RL agents, facilitating the solution of complex, long-horizon tasks as demonstrated in the real-world navigation experiment.
- Adaptability: A robot equipped with a repertoire of autonomously discovered skills can potentially adapt more quickly to new downstream tasks by composing or fine-tuning existing skills.
Conclusion
The VCRL paper provides a unifying perspective on MI-based unsupervised skill discovery, framing it as curriculum learning in GCRL. The proposed VUVC algorithm leverages value function uncertainty and state density to create an adaptive curriculum that demonstrably accelerates learning and improves state space coverage in complex navigation and manipulation tasks. Its effectiveness, supported by theoretical arguments and validated through extensive simulations and real-world robot experiments, makes it a promising approach for developing more autonomous and capable agents.
Related Papers
- Outcome-directed Reinforcement Learning by Uncertainty & Temporal Distance-Aware Curriculum Goal Generation (2023)
- INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL (2022)
- Behavior Contrastive Learning for Unsupervised Skill Discovery (2023)
- Variational Empowerment as Representation Learning for Goal-Based Reinforcement Learning (2021)
- Constrained Ensemble Exploration for Unsupervised Skill Discovery (2024)