Dynamic Contrastive Skill Learning

Updated 2 June 2026

Dynamic Contrastive Skill Learning (DCSL) is a framework that uses state-transition embeddings and contrastive learning to discover and represent skills in offline RL.
It employs a similarity function to cluster semantically related skill segments, ensuring flexible and coherent behavior representations.
DCSL dynamically adjusts skill durations through adaptive relabeling, improving performance in long-horizon, sparse reward, and noisy-data environments.

Dynamic Contrastive Skill Learning (DCSL) is a framework for skill discovery and representation in offline reinforcement learning (RL) that integrates state-transition-based skill embeddings, contrastive skill similarity, and adaptive skill-length adjustment. DCSL is designed to resolve limitations of prior skill learning methods—including failure to cluster semantically similar behaviors and rigidity in fixed skill segment lengths—by leveraging contrastive learning and dynamic segmentation. This approach enables flexible skill extraction from complex or noisy demonstrations and improves downstream RL performance on long-horizon, sparse-reward, and noisy-data tasks (Choi et al., 21 Apr 2025).

1. State-Transition Based Skill Representation

DCSL redefines skill primitives as latent vectors summarizing temporally coherent state transitions instead of fixed-length action blocks. Given an offline dataset $D = \{\tau_i\}_{i=1}^N$ of trajectories $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ where $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ , a skill is a segment starting at time $t$ with (potentially variable) length $H_t$ , represented as $z \in \mathcal{Z}$ and capturing state transitions $(s_t \rightarrow s_{t+1}, \dots, s_{t+H_t-1} \rightarrow s_{t+H_t})$ .

The embedding process selects four key states per candidate segment: the start $s_t$ , two random intermediates $s_{t+a}$ and $s_{t+b}$ , and the end $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 0, denoted as $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 1. An LSTM-based encoder $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 2 maps this sequence to a skill embedding $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 3 (where $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 4). This summarization is regularized through a combination of behavior cloning and prior-matching objectives:

$\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 5

where $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 6 is the skill-conditioned action decoder, $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 7 is a prior, and $\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 8 is a learned skill-prior conditioned on the start state.

2. Contrastive Skill Similarity Learning

DCSL introduces an explicit contrastive similarity mechanism to cluster semantically similar skill segments. The similarity function is formulated as

$\tau_i = \{(s_t, a_t)\}_{t=1}^T$ 9

where $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 0 and $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 1 are multi-layer perceptrons mapping to a shared $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 2-dimensional feature space, with $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 3 as the segment start, $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 4 the skill embedding, and $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 5 a potential successor state.

For each segment, positive pairs $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 6 are constructed where $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 7 is a large offset within $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 8. Negative states $s_t \in \mathcal{S},\ a_t \in \mathcal{A}$ 9 are sampled from other skill segments ( $t$ 0). The contrastive (binary) loss is

$t$ 1

where $t$ 2 is the logistic sigmoid. This encourages high similarity for a skill’s own successor states and low similarity for states from other segments.

3. Dynamic Skill Length Adjustment

Skill length is dynamically determined based on the contrastive similarity function. For a candidate start state $t$ 3 and its skill embedding $t$ 4, the procedure increments $t$ 5 forward along the trajectory, testing $t$ 6 for a chosen threshold $t$ 7 until violation. The resulting length is

$t$ 8

with $t$ 9 clamped to interval $H_t$ 0. This relabeling procedure is periodically applied to the dataset every $H_t$ 1 steps. The final skill boundaries adaptively reflect the duration over which the skill embedding remains semantically coherent, as judged by the learned similarity.

4. Model Training Objective and Algorithm

The overall objective is a weighted sum of the embedding loss, the contrastive loss, and a terminal-state predictor loss:

$H_t$ 2

The target loss $H_t$ 3 encourages the terminal state of a skill, predicted using the embedding, to align with the observed trajectory outcome via learned encoders and decoders.

Training proceeds by iteratively sampling minibatches, computing all losses, updating all network parameters via Adam (learning rate $H_t$ 4, batch size 256), and periodically running the skill length relabeling procedure. Key hyperparameters include initial skill length $H_t$ 5, skill embedding dimension $H_t$ 6, bounds $H_t$ 7, $H_t$ 8, $H_t$ 9, and loss weights $z \in \mathcal{Z}$ 0, $z \in \mathcal{Z}$ 1, $z \in \mathcal{Z}$ 2, $z \in \mathcal{Z}$ 3, $z \in \mathcal{Z}$ 4, $z \in \mathcal{Z}$ 5.

5. Empirical Evaluation and Comparison

DCSL is evaluated across benchmark tasks:

AntMaze (D4RL medium-diverse, large-diverse): Long-horizon navigation with sparse rewards.
Kitchen (D4RL mixed-v0): A complex manipulation task with multiple subtasks.
Meta-World Pick-and-Place: Three settings—expert (ME), medium-replay (MR), full replay (RP, with noise).

Baselines include Behavioral Cloning (BC), Conservative Q-Learning (CQL), CQL+Off-DADS, CQL+OPAL, SPiRL, and SkiMo variants (SkiMo-SAC, SkiMo-CEM).

Downstream Task Performance (Success Rate)

Environment	BC	CQL	CQL+Off-DADS	CQL+OPAL	Ours-SAC
AntMaze-M	0.0	53.7±6.1	59.6±2.9	81.1±3.1	68.0±36.9
AntMaze-L	0.0	14.9±3.2	–	70.3±2.9	73.7±5.9
Kitchen	47.5	52.4±2.5	–	69.3±2.7	94.7±1.5

DCSL provides comparable or superior task completion, particularly in the Kitchen task where it significantly outperforms all baselines.

Sample Efficiency (Timesteps to Success)

Environment	SPiRL	SkiMo-CEM	SkiMo-SAC	Ours-CEM	Ours-SAC
AntMaze-M	988.5±19.8	311.2±95.7	833.7±288	1000±0	453.6±144
AntMaze-L	990.2±19.5	993.5±13.9	881.5±165	1000±0	672.2±72.9
Kitchen	276.6±5.9	205.8±29.0	251.3±23.7	262.0±20.1	165.1±4.4
PP (ME)	87.8±65.0	54.1±21.3	58.0±6.8	76.0±15.8	80.1±13.7
PP (MR)	138.0±63.2	184.8±24.0	87.3±57.6	62.9±5.8	56.1±5.1
PP (RP)	130.6±69.4	193.2±13.5	200.0±0.0	85.1±22.3	64.4±16.3

Ablations indicate that removing either the contrastive similarity loss or dynamic relabeling degrades robustness, particularly on noisy data.

Skill-space visualizations in AntMaze reveal more diversified and semantically meaningful skill clusters under DCSL compared to prior fixed-length skill VAEs (SPiRL, SkiMo), which tend to collapse into a small repertoire of repetitive patterns. Skill-length distributions inferred by DCSL reflect task structure, with variable-length skills adapting to the environment.

6. Distinctive Methodological Contributions

DCSL advances skill discovery and representation by:

Skill Embedding via State Transitions: Encoding skills not as raw action blocks but as latent representations abstracting multi-step state transitions, thus centering the semantic context of behavior.
Contrastive Similarity-Based Clustering: Employing a learned function $z \in \mathcal{Z}$ 6 to cluster and differentiate skill segments semantically, informed by contrastive penalties.
Dynamic Skill-Length Relabeling: Regularly re-evaluating segment boundaries based on the learned similarity, yielding context-sensitive skill durations that better align with underlying behavioral motifs.

These innovations enable extraction of more flexible, generalizable, and data-driven skill libraries suitable for hierarchical RL settings and complex offline datasets.

7. Context, Limitations, and Implications

DCSL directly addresses limitations in existing fixed-length or VAE-based skill learning by introducing similarity-aware clustering and adaptive temporal abstraction. Results demonstrate improved flexibility on tasks with long horizons, sparse rewards, and imitation from diverse or noisy demonstrations.

A plausible implication is that DCSL’s adaptive mechanism could further benefit multi-task RL or transfer settings where skill distributions and durations are highly variable. However, the approach depends on well-calibrated similarity functions and thresholding. Excessive mismatch between contrastive supervision and actual task semantics could affect discovered skill coherence.

Further extensions could include end-to-end integration with downstream RL, meta-learning of similarity functions, or explicit incorporation of extrinsic task structure. The empirical and methodological contributions of DCSL position it as an advance in unsupervised skill discovery and trajectory abstraction (Choi et al., 21 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Dynamic Contrastive Skill Learning with State-Transition Based Skill Clustering and Dynamic Length Adjustment (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Contrastive Skill Learning (DCSL).