Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 109 tok/s
Gemini 3.0 Pro 52 tok/s Pro
Gemini 2.5 Flash 159 tok/s Pro
Kimi K2 203 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Periodic Skill Discovery (PSD) in RL

Updated 12 November 2025
  • Periodic Skill Discovery (PSD) is an unsupervised RL framework that encodes periodic behaviors using latent circular constraints to induce temporal regularity.
  • PSD leverages geometric constraints on encoder outputs to synthesize multi-timescale skills, eliminating the need for manual reward design in robotics applications.
  • Empirical evaluations demonstrate that PSD produces diverse, periodic skills with superior performance on classic and pixel-based RL benchmarks compared to prior methods.

Periodic Skill Discovery (PSD) is an unsupervised skill discovery framework for reinforcement learning (RL) environments that explicitly targets the identification and synthesis of periodic behaviors. Periodicity—repetition of behavior with fixed or variable periods—is central to many robotics domains, especially those involving locomotion, where skill diversity across temporal scales is critical. PSD achieves this by constraining skill representations to lie on latent circles with period-specific diameters, enabling policies to synthesize and exploit temporally-structured skills without manual reward engineering or supervision. PSD demonstrates superior performance and skill diversity on both classic and pixel-based RL benchmarks, and can be combined with alternative skill discovery methods to yield multi-dimensional behavior spaces.

1. Motivation and Background

Unsupervised skill discovery methods in RL (such as DIAYN, DADS, CSD, METRA) optimize objectives that force mutual information growth between states and skills, or maximize latent-space trajectories, typically resulting in skills that are diverse in endpoints or in movement speed but are rarely periodic or multi-temporal in nature. Such approaches do not model periodicity directly, which is a fundamental property of many robotic motions including walking, swimming, and hopping.

Traditional robotics solutions to periodic skill creation rely on manually-designed central pattern generators or phase-variable reward shaping, often requiring domain expertise and offline data. PSD departs from these approaches by introducing a latent circle that directly encodes periodic behaviors as a geometric inductive bias via temporal distance metrics, and does so in a manner compatible with both state and pixel-based observations.

2. Mathematical Formulation

In PSD, the environment is modeled as an MDP M=(S,A,P)\mathcal M = (\mathcal S, \mathcal A, \mathcal P) without external reward. A policy π(as,L)\pi(a|s,L) is conditioned on the discrete period variable L{Lmin,,Lmax}L\in\{L_{\min}, \ldots, L_{\max}\}, dictating the desired period $2L$ of the agent’s behavior.

Circular Encoder Construction

PSD employs an encoder ϕ:S×NRd\phi: \mathcal S\times\mathbb N \to \mathbb R^d such that, for a trajectory {st}\{s_t\}, the embeddings ϕL(st)\phi_L(s_t) lie on the circumference of a circle of diameter LL. For sampled tuples (L,st,st+1,st+L)(L, s_t, s_{t+1}, s_{t+L}) from the replay buffer D\mathcal D, the following constraints are enforced: ϕL(st+L)ϕL(st)2L,ϕL(st+1)ϕL(st)2Lsin(π2L)\|\phi_L(s_{t+L})-\phi_L(s_t)\|_2 \leq L,\quad \|\phi_L(s_{t+1})-\phi_L(s_t)\|_2 \leq L\sin\left(\frac{\pi}{2L}\right) The first constraint guarantees that a $2L$-step trajectory closes the latent circle, while the second ensures equal angular progress per step. Explicit normalization to S1S^1 (circle in R2\mathbb R^2) is optional; PSD relies on the constraints to sculpt ϕL\phi_L within Rd\mathbb R^d.

3. Temporal Distance, Intrinsic Reward, and Objective

Periodicity is incentivized by maximizing intrinsic reward based on a temporal-distance metric. The optimal single-step chord length for a circle of diameter LL is: L=Lsin(π2L)\ell_L = L\sin\left(\frac{\pi}{2L}\right) Deviation from L\ell_L at each step tt yields

Δt=ϕL(st+1)ϕL(st)2L\Delta_t = \|\phi_L(s_{t+1}) - \phi_L(s_t)\|_2 - \ell_L

The intrinsic reward for PSD is then

rPSD(st,st+1,L)=exp(κΔt2)r_{\mathrm{PSD}}(s_t, s_{t+1}, L) = \exp(-\kappa \Delta_t^2)

where κ>0\kappa > 0 is a scale parameter.

Encoder and Policy Objective

The encoder ϕ\phi is optimized with expectation over sampled tuples: maxϕE(L,st,st+L)D[ϕL(st+L)ϕL(st)2kϕL(st+L)+ϕL(st)2]\max_{\phi}\quad\mathbb{E}_{(L, s_t, s_{t+L}) \sim \mathcal D}\Big[\|\phi_L(s_{t+L})-\phi_L(s_t)\|_2 - k\|\phi_L(s_{t+L}) + \phi_L(s_t)\|_2\Big] subject to the above distance constraints; kk centers the circle at the origin. Lagrangian relaxation introduces slack ϵ\epsilon and multipliers λ1,λ2\lambda_1,\lambda_2, resulting in objective JPSD,ϕ\mathcal J_{\rm PSD,\phi} where constraint violations incur penalties: JPSD,ϕ=ED[λ1min(ϵ,LϕL(st+L)ϕL(st)2)λ2min(ϵ,Lsin(π2L)ϕL(st+1)ϕL(st)2)]\mathcal J_{\rm PSD,\phi} = \mathbb{E}_{\mathcal D}[\cdots - \lambda_1\min(\epsilon, L-\|\phi_L(s_{t+L})-\phi_L(s_t)\|_2) - \lambda_2\min(\epsilon, L\sin(\frac{\pi}{2L})-\|\phi_L(s_{t+1})-\phi_L(s_t)\|_2)] The policy π\pi is trained via soft actor-critic (SAC), maximizing expected discounted intrinsic reward: maxπ E[t=0T1γtrPSD(st,st+1,L)]\max_\pi\ \mathbb{E}[\sum_{t=0}^{T-1}\gamma^t r_{\rm PSD}(s_t, s_{t+1}, L)]

4. Algorithmic Instantiation and Network Architecture

Training proceeds in epochs over episodes. For each episode, LL is sampled uniformly, trajectories are collected by rolling out π(as,L)\pi(a|s,L), and transitions stored in D\mathcal D. Each epoch includes:

  • Encoder ϕ\phi update via gradient ascent on JPSD,ϕ\mathcal J_{\rm PSD,\phi} using minibatches (L,st,st+1,st+L)(L, s_t, s_{t+1}, s_{t+L}).
  • Intrinsic reward rPSDr_{\rm PSD} computation for all new transitions.
  • SAC update for policy π\pi and two Q-network critics given rPSDr_{\rm PSD}.

Architecture details for state-based tasks:

  • Encoder ϕ\phi: MLP with two hidden layers, 1024 ReLU units each, output dimension d{3,6}d \in \{3, 6\}.
  • Policy π\pi: Same MLP structure; input is either ϕ\phi embedding or ss concatenated with a sinusoidal positional encoding of LL (D=8D=8).
  • Critics: Two networks with same structure as policy.

For pixel-based tasks:

  • Observation: Stack of three RGB frames, 90×9090\times90; randomly cropped to 84×8484\times84.
  • Encoder ϕ\phi: Four-layer CNN (32–32–64–64 filters, 3×33\times3 kernels, stride 2), followed by two-layer MLP.
  • Losses and intrinsic reward computation are as per state-based variant.

5. Empirical Results and Skill Diversity

PSD produces skills exhibiting frequency diversity well beyond prior frameworks. Fast Fourier Transform (FFT) analyses indicate that PSD’s latent trajectories cover a broad frequency spectrum, in contrast to DIAYN (static/low-frequency) and METRA/CSD (narrowband/high-frequency).

On downstream RL tasks (hurdle and friction variants for HalfCheetah and Walker2D), PSD policies outperform competing methods in average episodic return (mean ± std across 10 seeds):

Environment DIAYN DADS CSD METRA PSD
HalfCheetah-hurdle 0.6±0.5 0.9±0.3 0.8±0.6 1.9±0.8 3.8±2.0
Walker2D-hurdle 2.6±0.5 1.9±0.3 4.1±1.3 3.1±0.5 5.4±1.4
HalfCheetah-friction 13.2±3.4 12.4±2.9 12.5±3.8 30.1±13.1 43.4±19.1
Walker2D-friction 4.6±1.2 1.6±0.1 5.3±0.3 5.2±1.6 8.7±1.7

PSD skills enable coordinated, multi-timescale behavior as evidenced by smoother transitions and higher returns in the presence of obstacles or varying friction.

Empirical ablations indicate that adaptive sampling of Lmin,LmaxL_{\min},L_{\max} expands feasible skill periods, with empirical step-size and period histograms matching theoretical values to within 5% relative error (see Figure 1 and Figure 2 in (Park et al., 5 Nov 2025)).

6. Integration with METRA and Multi-Dimensional Skills

METRA optimizes movement direction via latent vector zz, with encoder policy constrained to unit step-size in latent space. PSD and METRA can be integrated by cross-conditioning each encoder on the other’s skill variable, forming joint encoders ϕL(s,z)\phi_L(s,z) and ϕm(s,L)\phi_m(s,L) and training with summed rewards rPSD+rMETRAr_{\rm PSD} + r_{\rm METRA}. This produces a skill space where zz controls movement direction and LL controls period, both learned in a fully unsupervised fashion. FFT and t-SNE visualizations in (Park et al., 5 Nov 2025) demonstrate extensive coverage of both skill dimensions.

7. Implementation and Reproducibility

PSD experiments run on single NVIDIA A6000 GPUs, with ~24 hours required per task. Exact hyperparameters are:

  • Encoder/policy/critic learning rate: 10410^{-4}
  • Discount factor: γ=0.99\gamma=0.99
  • SAC target smoothing: 0.995
  • Automated entropy tuning
  • Replay buffer: 5×1055\times10^5 transitions (state), 3×1043\times10^4 (pixel)
  • Minibatch size: 1024 (encoder), 256 (policy/critics), 512/256 (pixel variant)
  • Circular latent dimension dd: {3,6}\{3,6\}
  • Positional-encoding dimension DD: 8 (state), 128 (pixel)
  • Loss slack ϵ=105\epsilon=10^{-5}, centering weight k=0.5k=0.5
  • Lagrange multipliers λ1=λ2=5\lambda_1=\lambda_2=5 (Ant, HC) or 10 (others)
  • Intrinsic reward coefficient κ=10\kappa=10
  • Episodes per epoch: 8; gradient steps: 64
  • Policy hidden layers: 2 × 1024 units
  • Downstream PPO (high level): LR actor 3×1043\times10^{-4}, critic 10310^{-3}, clip ϵ=0.2\epsilon=0.2, batch size 256, 80 updates/epoch

All environment wrappers, code, and random seeds are scheduled for public release at https://jonghaepark.github.io/psd/.

8. Significance and Implications

PSD demonstrates that geometric constraints are sufficient to induce temporal regularity and diversity in unsupervised skill discovery, outperforming prior methods on tasks that intrinsically require periodic coordination. The framework decouples period assignment from task supervision, and is compatible with both low- and high-dimensional observations. A plausible implication is that circular latent structure can serve as a domain-agnostic inductive bias in RL, facilitating unsupervised curriculum learning, modular skill libraries, and downstream transfer in robotics. The approach generalizes beyond locomotion and can be composed with other skill discovery objectives for multidimensional skill synthesis.

9. Common Misconceptions and Limitations

PSD does not require hand-crafted rewards or manual periodic priors, nor does it rely on specific domain knowledge; periodicity is induced solely via circular latent constraints. However, the framework presupposes that temporally-repeatable structure is a meaningful objective. Tasks absent temporal periodicity may not benefit from PSD’s geometric bias. Furthermore, period length LL is treated as a discrete variable, so precise control over period granularity may require fine parameter tuning or adaptive expansion. Scalability to very high-dimensional state or action spaces is conditional on encoder and policy capacity; results in (Park et al., 5 Nov 2025) indicate robustness up to 90×9090\times90 RGB pixel stacks.

Future directions include generalization to quasiperiodic or aperiodic skill discovery, hierarchical composition with non-periodic skills, and integration with skill segmentation approaches (cf. SKID RAW (Tanneberg et al., 2021)) for temporal abstraction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Periodic Skill Discovery (PSD).