Bottom-Up Skill Discovery

Updated 9 March 2026

Bottom-Up Skill Discovery is defined as an unsupervised RL paradigm where agents autonomously discover diverse skills through exploration and intrinsic objectives.
Key methodologies include mutual information maximization, distance- and density-based objectives, and structured state factorization to improve skill diversity and coverage.
Empirical evaluations demonstrate that these approaches boost sample efficiency and downstream task performance via hierarchical and incrementally discovered skill libraries.

Bottom-up skill discovery is a paradigm within unsupervised reinforcement learning (RL) and imitation learning in which an agent autonomously acquires a repertoire of diverse, reusable behaviors (“skills”) by acting in its environment with minimal or no task-specific guidance. Unlike top-down approaches, which require humans to specify task decomposition or skill libraries, bottom-up methodologies enable skills to emerge incrementally or hierarchically through exploration, guided objective optimization, or structure-aware discovery mechanisms. This article presents the core methodologies, mathematical formalisms, representative algorithms, empirical findings, and current research trajectories associated with bottom-up unsupervised skill discovery.

1. Mathematical Foundations and Objectives

The standard mathematical formalism for bottom-up skill discovery is built on maximizing the mutual information (MI) between a skill variable $Z$ and some measurement of the agent's behavior—states $S$ , state transitions, or successor representations. The canonical objective is

$I(Z;S) = H(Z) - H(Z|S) = H(S) - H(S|Z)$

which rewards acquiring a set of $k$ policies $\{\pi(\cdot|s,z):\,z=1,\dots,k\}$ whose induced state distributions $\rho_\pi(s|z)$ are easily distinguishable and collectively cover a broad region of the environment (Campos et al., 2020, Nieto et al., 2021, Park et al., 2022).

However, MI-based objectives often yield “static” or trivial skills due to invariance properties, leading to the development of alternative formulations:

Distance- or Direction-maximizing objectives: Encourage dynamic, far-reaching skills by maximizing expected movement in a learnable, often Lipschitz-constrained representation—e.g., $J_\mathrm{LSD}(\pi,f) = \mathbb{E}_{z,\tau}[(f(s_T)-f(s_0))^\top z]$ with a Lipschitz constraint on $f$ (Park et al., 2022, Hosseini et al., 2 Feb 2026).
State-density deviation: Encourage each skill $z$ to explore states not visited by other skills, e.g., maximizing the deviation of $d^\pi_{z}(s)$ from $\sum_{z' \neq z} d^\pi_{z'}(s)$ ; implemented via variational autoencoders or contrastive estimators (Xiao et al., 17 Jun 2025).
Empowerment and state-marginal covering: Direct computation or ELBO-based optimization of state-covering policies via latent-variable generative models pre-fit to coverage distributions $p(s)$ , decoupling skill assignment from policy reachability (Campos et al., 2020).
Regret-based curriculum: Adaptive, adversarial games between policy and skill generator, driving discovery along directions in skill space where the agent's value function is still improving; this ensures continual expansion and avoids redundancy— $\min_{\theta_1}\max_{\theta_2}\mathbb{E}_{z\sim P^k_z}[V_{\pi_{\theta_1}^k}(s_0|z) - V_{\pi_{\theta_1}^{k-1}}(s_0|z)]$ (Zhang et al., 26 Jun 2025).

2. Structured State Factorization and Compositionality

In environments with natural factorization (multibody robotics, multi-object scenes), representing the state as $S = S_1 \times \dots \times S_N$ enables more fine-grained and efficient skill discovery (Hosseini et al., 2 Feb 2026, Wang et al., 2024). Each latent skill vector $z = (z^1, \dots, z^N)$ can be aligned to specific factors ("editor’s term"), and exploration can be adaptively weighted per factor by observing the novelty or coverage of each component.

Structured Unsupervised Skill Discovery (SUSD) implements factor-specific embedding functions $\phi_i:S_i \rightarrow \mathbb{R}^D$ , defines factorized intrinsic rewards, and adaptively focuses on under-covered factors via a learned density model $q_e(s'|s)$ (Hosseini et al., 2 Feb 2026). This ensures disentangled skills capable of controlling specific entities and facilitates downstream hierarchical RL.
Skill Discovery from Local Dependencies (SkiLD) targets not just independent factor coverage but also the interaction graph (e.g., which entities affect each other during transitions). Skills are indexed by dependency matrices $g \in \{0,1\}^{N \times (N+1)}$ , capturing minimal causally-inducible subgraphs. Coverage of these graphs is directly linked to downstream manipulation success (Wang et al., 2024).

Compared to MI-only approaches, structured factorization yields a combinatorial reduction in the effective search space and unlocks skills that are semantically richer and more directly applicable in complex, compositional environments. Empirical results demonstrate SUSD’s $2-3\times$ improvement in worst-factor coverage and dramatically faster sample complexity on compositional HRL benchmarks (Hosseini et al., 2 Feb 2026); SkiLD outperforms state-coverage methods on long-horizon multi-object tasks, being the only method with nonzero success on the hardest benchmarks (Wang et al., 2024).

3. Algorithmic Pipelines and Hierarchical Organization

A majority of bottom-up skill discovery frameworks adopt either one of:

Simultaneous skill learning: All skills are discovered in parallel (e.g., fixed-size latents), usually with MI or state-density differentiation objectives (Park et al., 2022, Nieto et al., 2021, Xiao et al., 17 Jun 2025).
Incremental/Sequential: Skills are acquired one after another, freezing previously mastered policies to prevent catastrophic forgetting—each new skill is driven to cover previously unmodeled or altered environment regions (Shafiullah et al., 2022).
Hierarchical or tree-structured policies: Skill libraries are grown by curriculums such as tree policies (ELSIM), hierarchical selection over a topological cluster graph (DisTop), or multi-layered controllers that select skill/subgoal/atomic action (Aubret et al., 2020, Gehring et al., 2021, Du et al., 23 May 2025).

A representative, algorithm-agnostic pipeline for skill discovery and hierarchical reuse is:

Stage	Key Operations / Methods
Skill Initialization	Random, VQ-VAE, iterative growth, or LLM-based task synthesis
Skill Optimization	MI/distance/density-deviation objectives, dual updates over embedding/skill
Coverage Monitoring	State-factor per-skill, topological clusters, or interaction graphs
Downstream Composition	High-level policy (SAC, RL, LLM selection, clustering) over skill library
Continual/Evolving	Skill library is grown (sequential/incremental) or refined adaptively

Hierarchical execution often involves a high-level controller selecting a skill or subgoal every $L$ steps, passing control to the low-level policy $\pi_\theta(a|s,z)$ , which is typically frozen after discovery (Hosseini et al., 2 Feb 2026, Aubret et al., 2020, Gehring et al., 2021).

4. Bottom-Up Skill Discovery Beyond RL: Language and Imitation

Recent work extends bottom-up paradigms to LLM-driven and demonstration-based settings:

LLM-Driven Autonomous Skill Bootstrapping: An LLM proposes atomic tasks, codes success/reward functions, and orchestrates RL policy training; successful policies are verified by VLMs and aggregated without any prior skill library (Zhao et al., 2024). This yields a growing, reliable skill set starting from zero primitives, supporting further skill proposals and compositional, top-down planning as needed.
Unsegmented Demonstration Parsing: Hierarchical task structure is recovered from unsegmented demonstrations via clustering, skills are identified from recurring patterns, and meta-controllers compose these skills for long-horizon tasks. Skills extracted from multi-task demos boost average task success by 8% relative to per-task extraction [(Zhu et al., 2021) (abstract only)].

This line of work supports emergent libraries with semantically meaningful, verifiable skills applicable in open-ended, human-in-the-loop or sim2real scenarios.

5. Empirical Evaluation and Comparative Performance

Skill discovery methods are evaluated on criteria including:

State coverage: Fraction or count of state bins visited by at least one skill; coverage of each factor or interaction-graph component (Hosseini et al., 2 Feb 2026, Wang et al., 2024).
Skill diversity: Pairwise endpoint diversity/consistency (Hausdorff, variance metrics) (Shafiullah et al., 2022, Park et al., 2022).
Downstream task acceleration: Speed and success rate of hierarchical RL or finetuning on tasks requiring composition of discovered skills (Campos et al., 2020, Gehring et al., 2021, Jansonnie et al., 2024, Hosseini et al., 2 Feb 2026).
Zero-shot goal-reaching: Ability to reach arbitrary, possibly unseen goals using latent-conditioned policies and learned embedding (Park et al., 2022).
Continual adaptation and retention: Performance in non-stationary/evolving environments, measured by skill coverage, non-forgetting, and transfer to new contexts (Shafiullah et al., 2022, Zhang et al., 26 Jun 2025).

Empirical findings indicate that bottom-up, structure-aware and curriculum-driven methods achieve state-of-the-art performance in both coverage and sample efficiency compared to mutual-information or count-based baselines. Regret-aware skill generators, for example, improve zero-shot task success by up to 15% in high-dimensional settings (Zhang et al., 26 Jun 2025), while language-driven bootstrapping achieves robust, semantically meaningful behavior with full vision-language verification protocols (Zhao et al., 2024).

6. Challenges, Limitations, and Open Directions

While bottom-up skill discovery has advanced substantially, key open research problems include:

Representation constraints: Methods with state-metric or feature requirements (e.g., Lipschitz conditioning, state factorization) may suffer in high-dimensional observation spaces without structured priors (Park et al., 2022, Hosseini et al., 2 Feb 2026).
Skill redundancy and scalability: Uniform sampling or excessive overlap among skill policies, especially in non-factorized or large latent spaces. Curriculum or regret-aware selection mitigates but does not eliminate this (Zhang et al., 26 Jun 2025).
Skill functionalization and parameterization: Most skill libraries consist of flat, fixed policies; automatic task/parameter slot inference for reusable, generalized skill functions remains a major unsolved challenge (Du et al., 23 May 2025, Zhao et al., 2024).
Continual adaptation: Parameter footprint grows linearly with the number of independent skill modules. Selective parameter sharing, modularization, or memory-efficient extensions are needed for lifelong deployment (Shafiullah et al., 2022, Aubret et al., 2020).
Evaluation and benchmarking: Open-ended domains (e.g., games, real robotics) lack reproducible, standardized reset protocols, complicating benchmarking and skill success evaluation (Du et al., 23 May 2025, Zhao et al., 2024).
Sim2real transfer: While spectral decomposition methods (Ma et al., 2024) and robust verification address some gaps, systematic sim2real skill transfer and adaptation continue to be active directions.

7. Directions for Future Research

Advances in bottom-up skill discovery are converging with large-scale unsupervised representation learning, language-driven robotics, open-ended game agents, and lifelong RL. Promising directions include:

Tightly coupled factorization and compositional architectures, enabling efficient manipulation and interaction-centric behaviors in rich domains (Hosseini et al., 2 Feb 2026, Wang et al., 2024).
Regret- or curriculum-guided self-pacing, with learnable skill generators and adaptive skill mixture policies (Zhang et al., 26 Jun 2025).
Language-grounded, autonomous skill construction leveraging LLM generation of tasks, success/reward functionals, and compositional skill-chaining for open vocabulary primitives with RL and imitation (Zhao et al., 2024).
Enhanced downstream composition, reuse, and abstraction, possibly through multiplicative or modular compositional policies (Jansonnie et al., 2024).
Robust sim2real pipelines, including spectral skill basis expansions and orthogonalization to handle dynamic mismatches (Ma et al., 2024).
Hybrid implicit–explicit objectives, combining MI/density maximizing with directed, graph- or task-conditioned constraints.
Open-ended agent ecosystems with asynchronous skill sharing, consensus, and decentralized conflict resolution (Du et al., 23 May 2025).

The rapid evolution of bottom-up skill discovery continues to redefine the practical and theoretical limits of autonomous lifelong learning, with deep implications for embodied AI, robotics, and open-world agent design.