Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Contrastive Skill Learning

Updated 2 June 2026
  • Dynamic Contrastive Skill Learning (DCSL) is a framework that uses state-transition embeddings and contrastive learning to discover and represent skills in offline RL.
  • It employs a similarity function to cluster semantically related skill segments, ensuring flexible and coherent behavior representations.
  • DCSL dynamically adjusts skill durations through adaptive relabeling, improving performance in long-horizon, sparse reward, and noisy-data environments.

Dynamic Contrastive Skill Learning (DCSL) is a framework for skill discovery and representation in offline reinforcement learning (RL) that integrates state-transition-based skill embeddings, contrastive skill similarity, and adaptive skill-length adjustment. DCSL is designed to resolve limitations of prior skill learning methods—including failure to cluster semantically similar behaviors and rigidity in fixed skill segment lengths—by leveraging contrastive learning and dynamic segmentation. This approach enables flexible skill extraction from complex or noisy demonstrations and improves downstream RL performance on long-horizon, sparse-reward, and noisy-data tasks (Choi et al., 21 Apr 2025).

1. State-Transition Based Skill Representation

DCSL redefines skill primitives as latent vectors summarizing temporally coherent state transitions instead of fixed-length action blocks. Given an offline dataset D={τi}i=1ND = \{\tau_i\}_{i=1}^N of trajectories τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T where st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}, a skill is a segment starting at time tt with (potentially variable) length HtH_t, represented as z∈Zz \in \mathcal{Z} and capturing state transitions (st→st+1,…,st+Ht−1→st+Ht)(s_t \rightarrow s_{t+1}, \dots, s_{t+H_t-1} \rightarrow s_{t+H_t}).

The embedding process selects four key states per candidate segment: the start sts_t, two random intermediates st+as_{t+a} and st+bs_{t+b}, and the end τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T0, denoted as τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T1. An LSTM-based encoder τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T2 maps this sequence to a skill embedding τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T3 (where τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T4). This summarization is regularized through a combination of behavior cloning and prior-matching objectives:

τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T5

where τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T6 is the skill-conditioned action decoder, τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T7 is a prior, and τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T8 is a learned skill-prior conditioned on the start state.

2. Contrastive Skill Similarity Learning

DCSL introduces an explicit contrastive similarity mechanism to cluster semantically similar skill segments. The similarity function is formulated as

τi={(st,at)}t=1T\tau_i = \{(s_t, a_t)\}_{t=1}^T9

where st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}0 and st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}1 are multi-layer perceptrons mapping to a shared st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}2-dimensional feature space, with st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}3 as the segment start, st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}4 the skill embedding, and st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}5 a potential successor state.

For each segment, positive pairs st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}6 are constructed where st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}7 is a large offset within st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}8. Negative states st∈S, at∈As_t \in \mathcal{S},\ a_t \in \mathcal{A}9 are sampled from other skill segments (tt0). The contrastive (binary) loss is

tt1

where tt2 is the logistic sigmoid. This encourages high similarity for a skill’s own successor states and low similarity for states from other segments.

3. Dynamic Skill Length Adjustment

Skill length is dynamically determined based on the contrastive similarity function. For a candidate start state tt3 and its skill embedding tt4, the procedure increments tt5 forward along the trajectory, testing tt6 for a chosen threshold tt7 until violation. The resulting length is

tt8

with tt9 clamped to interval HtH_t0. This relabeling procedure is periodically applied to the dataset every HtH_t1 steps. The final skill boundaries adaptively reflect the duration over which the skill embedding remains semantically coherent, as judged by the learned similarity.

4. Model Training Objective and Algorithm

The overall objective is a weighted sum of the embedding loss, the contrastive loss, and a terminal-state predictor loss:

HtH_t2

The target loss HtH_t3 encourages the terminal state of a skill, predicted using the embedding, to align with the observed trajectory outcome via learned encoders and decoders.

Training proceeds by iteratively sampling minibatches, computing all losses, updating all network parameters via Adam (learning rate HtH_t4, batch size 256), and periodically running the skill length relabeling procedure. Key hyperparameters include initial skill length HtH_t5, skill embedding dimension HtH_t6, bounds HtH_t7, HtH_t8, HtH_t9, and loss weights z∈Zz \in \mathcal{Z}0, z∈Zz \in \mathcal{Z}1, z∈Zz \in \mathcal{Z}2, z∈Zz \in \mathcal{Z}3, z∈Zz \in \mathcal{Z}4, z∈Zz \in \mathcal{Z}5.

5. Empirical Evaluation and Comparison

DCSL is evaluated across benchmark tasks:

  • AntMaze (D4RL medium-diverse, large-diverse): Long-horizon navigation with sparse rewards.
  • Kitchen (D4RL mixed-v0): A complex manipulation task with multiple subtasks.
  • Meta-World Pick-and-Place: Three settings—expert (ME), medium-replay (MR), full replay (RP, with noise).

Baselines include Behavioral Cloning (BC), Conservative Q-Learning (CQL), CQL+Off-DADS, CQL+OPAL, SPiRL, and SkiMo variants (SkiMo-SAC, SkiMo-CEM).

Downstream Task Performance (Success Rate)

Environment BC CQL CQL+Off-DADS CQL+OPAL Ours-SAC
AntMaze-M 0.0 53.7±6.1 59.6±2.9 81.1±3.1 68.0±36.9
AntMaze-L 0.0 14.9±3.2 – 70.3±2.9 73.7±5.9
Kitchen 47.5 52.4±2.5 – 69.3±2.7 94.7±1.5

DCSL provides comparable or superior task completion, particularly in the Kitchen task where it significantly outperforms all baselines.

Sample Efficiency (Timesteps to Success)

Environment SPiRL SkiMo-CEM SkiMo-SAC Ours-CEM Ours-SAC
AntMaze-M 988.5±19.8 311.2±95.7 833.7±288 1000±0 453.6±144
AntMaze-L 990.2±19.5 993.5±13.9 881.5±165 1000±0 672.2±72.9
Kitchen 276.6±5.9 205.8±29.0 251.3±23.7 262.0±20.1 165.1±4.4
PP (ME) 87.8±65.0 54.1±21.3 58.0±6.8 76.0±15.8 80.1±13.7
PP (MR) 138.0±63.2 184.8±24.0 87.3±57.6 62.9±5.8 56.1±5.1
PP (RP) 130.6±69.4 193.2±13.5 200.0±0.0 85.1±22.3 64.4±16.3

Ablations indicate that removing either the contrastive similarity loss or dynamic relabeling degrades robustness, particularly on noisy data.

Skill-space visualizations in AntMaze reveal more diversified and semantically meaningful skill clusters under DCSL compared to prior fixed-length skill VAEs (SPiRL, SkiMo), which tend to collapse into a small repertoire of repetitive patterns. Skill-length distributions inferred by DCSL reflect task structure, with variable-length skills adapting to the environment.

6. Distinctive Methodological Contributions

DCSL advances skill discovery and representation by:

  • Skill Embedding via State Transitions: Encoding skills not as raw action blocks but as latent representations abstracting multi-step state transitions, thus centering the semantic context of behavior.
  • Contrastive Similarity-Based Clustering: Employing a learned function z∈Zz \in \mathcal{Z}6 to cluster and differentiate skill segments semantically, informed by contrastive penalties.
  • Dynamic Skill-Length Relabeling: Regularly re-evaluating segment boundaries based on the learned similarity, yielding context-sensitive skill durations that better align with underlying behavioral motifs.

These innovations enable extraction of more flexible, generalizable, and data-driven skill libraries suitable for hierarchical RL settings and complex offline datasets.

7. Context, Limitations, and Implications

DCSL directly addresses limitations in existing fixed-length or VAE-based skill learning by introducing similarity-aware clustering and adaptive temporal abstraction. Results demonstrate improved flexibility on tasks with long horizons, sparse rewards, and imitation from diverse or noisy demonstrations.

A plausible implication is that DCSL’s adaptive mechanism could further benefit multi-task RL or transfer settings where skill distributions and durations are highly variable. However, the approach depends on well-calibrated similarity functions and thresholding. Excessive mismatch between contrastive supervision and actual task semantics could affect discovered skill coherence.

Further extensions could include end-to-end integration with downstream RL, meta-learning of similarity functions, or explicit incorporation of extrinsic task structure. The empirical and methodological contributions of DCSL position it as an advance in unsupervised skill discovery and trajectory abstraction (Choi et al., 21 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Contrastive Skill Learning (DCSL).