Periodic Skill Discovery (PSD) in RL
- Periodic Skill Discovery (PSD) is an unsupervised RL framework that encodes periodic behaviors using latent circular constraints to induce temporal regularity.
- PSD leverages geometric constraints on encoder outputs to synthesize multi-timescale skills, eliminating the need for manual reward design in robotics applications.
- Empirical evaluations demonstrate that PSD produces diverse, periodic skills with superior performance on classic and pixel-based RL benchmarks compared to prior methods.
Periodic Skill Discovery (PSD) is an unsupervised skill discovery framework for reinforcement learning (RL) environments that explicitly targets the identification and synthesis of periodic behaviors. Periodicity—repetition of behavior with fixed or variable periods—is central to many robotics domains, especially those involving locomotion, where skill diversity across temporal scales is critical. PSD achieves this by constraining skill representations to lie on latent circles with period-specific diameters, enabling policies to synthesize and exploit temporally-structured skills without manual reward engineering or supervision. PSD demonstrates superior performance and skill diversity on both classic and pixel-based RL benchmarks, and can be combined with alternative skill discovery methods to yield multi-dimensional behavior spaces.
1. Motivation and Background
Unsupervised skill discovery methods in RL (such as DIAYN, DADS, CSD, METRA) optimize objectives that force mutual information growth between states and skills, or maximize latent-space trajectories, typically resulting in skills that are diverse in endpoints or in movement speed but are rarely periodic or multi-temporal in nature. Such approaches do not model periodicity directly, which is a fundamental property of many robotic motions including walking, swimming, and hopping.
Traditional robotics solutions to periodic skill creation rely on manually-designed central pattern generators or phase-variable reward shaping, often requiring domain expertise and offline data. PSD departs from these approaches by introducing a latent circle that directly encodes periodic behaviors as a geometric inductive bias via temporal distance metrics, and does so in a manner compatible with both state and pixel-based observations.
2. Mathematical Formulation
In PSD, the environment is modeled as an MDP without external reward. A policy is conditioned on the discrete period variable , dictating the desired period $2L$ of the agent’s behavior.
Circular Encoder Construction
PSD employs an encoder such that, for a trajectory , the embeddings lie on the circumference of a circle of diameter . For sampled tuples from the replay buffer , the following constraints are enforced: The first constraint guarantees that a $2L$-step trajectory closes the latent circle, while the second ensures equal angular progress per step. Explicit normalization to (circle in ) is optional; PSD relies on the constraints to sculpt within .
3. Temporal Distance, Intrinsic Reward, and Objective
Periodicity is incentivized by maximizing intrinsic reward based on a temporal-distance metric. The optimal single-step chord length for a circle of diameter is: Deviation from at each step yields
The intrinsic reward for PSD is then
where is a scale parameter.
Encoder and Policy Objective
The encoder is optimized with expectation over sampled tuples: subject to the above distance constraints; centers the circle at the origin. Lagrangian relaxation introduces slack and multipliers , resulting in objective where constraint violations incur penalties: The policy is trained via soft actor-critic (SAC), maximizing expected discounted intrinsic reward:
4. Algorithmic Instantiation and Network Architecture
Training proceeds in epochs over episodes. For each episode, is sampled uniformly, trajectories are collected by rolling out , and transitions stored in . Each epoch includes:
- Encoder update via gradient ascent on using minibatches .
- Intrinsic reward computation for all new transitions.
- SAC update for policy and two Q-network critics given .
Architecture details for state-based tasks:
- Encoder : MLP with two hidden layers, 1024 ReLU units each, output dimension .
- Policy : Same MLP structure; input is either embedding or concatenated with a sinusoidal positional encoding of ().
- Critics: Two networks with same structure as policy.
For pixel-based tasks:
- Observation: Stack of three RGB frames, ; randomly cropped to .
- Encoder : Four-layer CNN (32–32–64–64 filters, kernels, stride 2), followed by two-layer MLP.
- Losses and intrinsic reward computation are as per state-based variant.
5. Empirical Results and Skill Diversity
PSD produces skills exhibiting frequency diversity well beyond prior frameworks. Fast Fourier Transform (FFT) analyses indicate that PSD’s latent trajectories cover a broad frequency spectrum, in contrast to DIAYN (static/low-frequency) and METRA/CSD (narrowband/high-frequency).
On downstream RL tasks (hurdle and friction variants for HalfCheetah and Walker2D), PSD policies outperform competing methods in average episodic return (mean ± std across 10 seeds):
| Environment | DIAYN | DADS | CSD | METRA | PSD |
|---|---|---|---|---|---|
| HalfCheetah-hurdle | 0.6±0.5 | 0.9±0.3 | 0.8±0.6 | 1.9±0.8 | 3.8±2.0 |
| Walker2D-hurdle | 2.6±0.5 | 1.9±0.3 | 4.1±1.3 | 3.1±0.5 | 5.4±1.4 |
| HalfCheetah-friction | 13.2±3.4 | 12.4±2.9 | 12.5±3.8 | 30.1±13.1 | 43.4±19.1 |
| Walker2D-friction | 4.6±1.2 | 1.6±0.1 | 5.3±0.3 | 5.2±1.6 | 8.7±1.7 |
PSD skills enable coordinated, multi-timescale behavior as evidenced by smoother transitions and higher returns in the presence of obstacles or varying friction.
Empirical ablations indicate that adaptive sampling of expands feasible skill periods, with empirical step-size and period histograms matching theoretical values to within 5% relative error (see Figure 1 and Figure 2 in (Park et al., 5 Nov 2025)).
6. Integration with METRA and Multi-Dimensional Skills
METRA optimizes movement direction via latent vector , with encoder policy constrained to unit step-size in latent space. PSD and METRA can be integrated by cross-conditioning each encoder on the other’s skill variable, forming joint encoders and and training with summed rewards . This produces a skill space where controls movement direction and controls period, both learned in a fully unsupervised fashion. FFT and t-SNE visualizations in (Park et al., 5 Nov 2025) demonstrate extensive coverage of both skill dimensions.
7. Implementation and Reproducibility
PSD experiments run on single NVIDIA A6000 GPUs, with ~24 hours required per task. Exact hyperparameters are:
- Encoder/policy/critic learning rate:
- Discount factor:
- SAC target smoothing: 0.995
- Automated entropy tuning
- Replay buffer: transitions (state), (pixel)
- Minibatch size: 1024 (encoder), 256 (policy/critics), 512/256 (pixel variant)
- Circular latent dimension :
- Positional-encoding dimension : 8 (state), 128 (pixel)
- Loss slack , centering weight
- Lagrange multipliers (Ant, HC) or 10 (others)
- Intrinsic reward coefficient
- Episodes per epoch: 8; gradient steps: 64
- Policy hidden layers: 2 × 1024 units
- Downstream PPO (high level): LR actor , critic , clip , batch size 256, 80 updates/epoch
All environment wrappers, code, and random seeds are scheduled for public release at https://jonghaepark.github.io/psd/.
8. Significance and Implications
PSD demonstrates that geometric constraints are sufficient to induce temporal regularity and diversity in unsupervised skill discovery, outperforming prior methods on tasks that intrinsically require periodic coordination. The framework decouples period assignment from task supervision, and is compatible with both low- and high-dimensional observations. A plausible implication is that circular latent structure can serve as a domain-agnostic inductive bias in RL, facilitating unsupervised curriculum learning, modular skill libraries, and downstream transfer in robotics. The approach generalizes beyond locomotion and can be composed with other skill discovery objectives for multidimensional skill synthesis.
9. Common Misconceptions and Limitations
PSD does not require hand-crafted rewards or manual periodic priors, nor does it rely on specific domain knowledge; periodicity is induced solely via circular latent constraints. However, the framework presupposes that temporally-repeatable structure is a meaningful objective. Tasks absent temporal periodicity may not benefit from PSD’s geometric bias. Furthermore, period length is treated as a discrete variable, so precise control over period granularity may require fine parameter tuning or adaptive expansion. Scalability to very high-dimensional state or action spaces is conditional on encoder and policy capacity; results in (Park et al., 5 Nov 2025) indicate robustness up to RGB pixel stacks.
Future directions include generalization to quasiperiodic or aperiodic skill discovery, hierarchical composition with non-periodic skills, and integration with skill segmentation approaches (cf. SKID RAW (Tanneberg et al., 2021)) for temporal abstraction.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free