Adversarial Skill Compositional Training

Updated 3 July 2026

ASCoT is a framework that integrates adversarial min–max optimization with skill compositionality, enabling robust performance in complex RL and LLM tasks.
It employs techniques like skill chaining in RL and adversarial prompt synthesis in LLMs, ensuring effective handling of combinatorially complex scenarios.
Empirical results demonstrate state-of-the-art performance in robotics manipulation and LLM jailbreak defense, highlighting ASCoT’s scalability and robustness.

Adversarial Skill Compositional Training (ASCoT) designates a class of frameworks and algorithms for achieving robust compositional generalization in both reinforcement learning (RL) and LLM domains. The central principle is to train policies or models not merely to master isolated skills or attack-defense instances, but to perform well (in the robustness or alignment sense) under adversarial—often combinatorially complex—compositions of skills or input primitives. ASCoT frameworks unify adversarially-motivated min–max training, skill/task compositionality, and scalable implementation mechanics grounded in recent theoretical and empirical advances (Lee et al., 2021, Dabas et al., 24 Oct 2025, Jothimurugan et al., 2023).

1. Core Theoretical Frameworks

Three primary instantiations of ASCoT have been developed in the literature, each addressing distinct challenges:

Adversarial Skill Chaining for RL: In long-horizon manipulation, the challenge is to sequence $K$ pre-trained or learned subskills $\pi^1,\dotsc,\pi^K$ such that the terminal-state distribution $\beta_{i-1}$ of skill $i-1$ lies within the initiation set $\mathcal{I}_i$ of $i$ for all $i=2,\dotsc,K$ . Naïve methods expand initiation sets, leading to compounding difficulty and explosion of reachable states. ASCoT (notably the T-STAR algorithm) regularizes the terminal states to ensure $\beta_{i} \approx \mathcal{I}_{i+1}$ via adversarial losses, enforcing a compositional interface between skills (Lee et al., 2021).
Robust Subtask Learning (Adversarial Subtask Games): Here, a collection of subtask policies $\Pi = \{\pi_\sigma : \sigma\in \Sigma\}$ is trained in a multi-task MDP, facing an adversary that selects the worst-case orderings of subtasks. The design objective is to maximize $\inf_{\tau} J(\Pi; \tau)$ , with $\pi^1,\dotsc,\pi^K$ 0 any subtask sequence, reducing to a two-player zero-sum Markov game (Jothimurugan et al., 2023).
Compositional Adversarial Robustness in LLM Alignment: In the context of LLM jailbreak defense, most new adversarial attacks are compositional recombinations of a finite “dictionary” of skill primitives. ASCoT here refers to fine-tuning on adversarial examples synthesized by composing several detectable primitives, thereby ensuring robustness to unseen skill recombinations (Dabas et al., 24 Oct 2025).

2. Formal Objectives and Min–Max Training

ASCoT instantiates compositional training via explicit min–max or adversarial optimization objectives tailored to the application domain.

RL/Skill Chaining: The core reward for each skill combines environment rewards $\pi^1,\dotsc,\pi^K$ 1, a GAIL-style imitation loss for subtask mastery, and a terminal-state regularization bonus:

$\pi^1,\dotsc,\pi^K$ 2

GAIL adversarial imitation loss is enforced by a discriminator $\pi^1,\dotsc,\pi^K$ 3; terminal-state regularization uses a discriminator $\pi^1,\dotsc,\pi^K$ 4 trained to distinguish successful starts from previous skill terminations, directly incentivizing terminal distributions to match initiation sets (Lee et al., 2021).

Worst-case Subtask Sequencing: The value function is optimized for adversarial sequences:

$\pi^1,\dotsc,\pi^K$ 5

Each subtask-nexus transition is adversarially chosen, and Bellman-adversarial equations capture worst-case reasoning (Jothimurugan et al., 2023).

Skill-Space Adversarial Training (LLMs): Given a learned dictionary $\pi^1,\dotsc,\pi^K$ 6 of primitives, adversarial prompts are generated by composing random subsets with harmful instructions. The loss is ordinary negative log-likelihood on these synthetic compositions, so the LLM must learn to refuse combinations not seen in training but systematically covered via the skills combinatorial space (Dabas et al., 24 Oct 2025):

$\pi^1,\dotsc,\pi^K$ 7

3. Algorithms and Training Procedures

ASCoT implementations leverage adversarial pipeline design, RL, and dictionary learning.

T-STAR (RL, Long-Horizon Manipulation):
- Initial pre-training of skill policies via GAIL and RL in isolation.
- Iterative alternate training:
- For subtask $\pi^1,\dotsc,\pi^K$ 8, rollout from mixture of environment and prior-skill terminal states.
- Update GAIL discriminators, initiation-set discriminators.
- RL update for policy incorporating all reward terms.
- Terminal state buffers maintain compactness of $\pi^1,\dotsc,\pi^K$ 9 (Lee et al., 2021).
Rosac/Arosac (Compositional RL):
- Data collection alternates skill execution.
- At each subtask termination, select next subtask adversarially (greedy/minimum-value or with MCTS).
- Q-function and policy networks updated via Soft Actor-Critic; Asynchronous variant trains multiple subtasks in parallel with periodic cross-subtask value estimation.
- Pseudocode specifies buffer management, value backups, and adversarial sampling (Jothimurugan et al., 2023).
ASCoT (LLM Alignment, Jailbreak Robustness):
- Primitives extracted via LLM-aided annotation and dictionary learning with K-SVD and redundant skill pruning.
- Harmful initial queries are composed (at depth 1–5) with sampled skill primitives via LLM prompt engineering.
- Training batches interleave compositional adversarial, benign, and over-refusal calibration examples.
- Model fine-tuned with LoRA adapters on the merged dataset, using cross-entropy loss (Dabas et al., 24 Oct 2025).

4. Empirical Results and Comparative Performance

ASCoT variants demonstrate empirical state-of-the-art performance relative to baselines in their domains.

Task and Environment Benchmarks:
- RL/Robotics: IKEA furniture assembly, Rooms navigation, and F1/10 car racing benchmarks.
- LLM Defense: StrongReject suite (LLM jailbreaks), with direct requests, unseen attack transfer (AutoDAN-Turbo), and implicit reference settings.
Quantitative Results:

| Method | table_lack | chair_ingolf | |-----------------------|-------------------|-------------------| | BC | 0.03 ± 0.00 | 0.04 ± 0.01 | | PPO | 0.09 ± 0.11 | 0.14 ± 0.03 | | GAIL | 0.00 ± 0.00 | 0.00 ± 0.00 | | GAIL + PPO | 0.21 ± 0.11 | 0.22 ± 0.08 | | SPiRL | 0.05 ± 0.00 | 0.03 ± 0.00 | | Policy Sequencing | 0.63 ± 0.28 | 0.77 ± 0.12 | | T-STAR (ASCoT) | 0.90 ± 0.07 | 0.89 ± 0.04 |

Skill Chaining: T-STAR achieves the first model-free RL solution to long-horizon assembly tasks, with ≥87% success rates and 0.90 progress, outperforming prior skill-chaining and imitation learning methods (Lee et al., 2021).
Compositional RL: ASCoT (Rosac/Arosac) matches/exceeds alternate multi-agent and unsupervised adversarial curriculum baselines, demonstrating superior robustness and data efficiency as measured by subtask-completion under both random and MCTS adversaries (Jothimurugan et al., 2023).
LLM Robustness: ASCoT consistently lowers harmfulness on new unseen adversarial attacks (harmfulness ≈ 0.10–0.12 vs. 0.14–0.43 for other methods), and maintains low over-refusal rates (ORR ≈ 0.05), competitive with or surpassing leading closed-source and open-source models (Dabas et al., 24 Oct 2025).

5. Theoretical Properties and Insights

Chaining Condition via Distributional Regularization: In the limit of discriminator capacity, the ASCoT terminal-state regularizer enforces

$\beta_{i-1}$ 0

and maximizing the terminal-state regularization encourages $\beta_{i-1}$ 1 (Jensen–Shannon divergence minimization), formalizing modular compositionality (Lee et al., 2021).

Skill-Space Coverage vs. Data Scale: In LLM settings, robustness is shown empirically to depend not just on the amount of adversarial data, but the breadth of primitive skill combinations covered. Monotonic improvements are observed with increased skill primitives included in training (coverage dividend), but diminishing returns occur as dictionary completeness is approached (Dabas et al., 24 Oct 2025).
Game-Theoretic Guarantees: The reduction of adversarial subtask sequencing to a two-player stagewise Markov game provides a foundation for convergence analysis—both for policy/value iteration in tabular settings and in continuous spaces via approximate RL (Jothimurugan et al., 2023).

6. Limitations and Prospective Extensions

Current Constraints:
- Fixed set of subtasks/skills or dictionary primitives is assumed; automatic discovery remains open.
- Adversarial training can be unstable, especially with high weighting of regularization terms (e.g., $\beta_{i-1}$ 2 in T-STAR).
- Most instantiations require access to per-skill success indicators, expert demonstrations, or explicit harmful prompts.
- Simulated environments assume automatic resets and precise subgoal definitions.
Potential Extensions:
- Automated, unsupervised option or primitive discovery—e.g., via hierarchical or information-theoretic methods.
- Online or continual augmentation of skill dictionaries to address drift in adversarial tactics (for LLMs).
- Higher-order compositionality: joint training of high-level planners atop ASCoT-trained skills or primitives; non-linear skill interactions.
- Transferring methods to real-world/multi-modal domains, vision-based observation models, or interactive/multi-turn dialogue settings.
- Reducing dependence on demonstrations or labeled data via large-scale offline RL or unsupervised exploration (Dabas et al., 24 Oct 2025, Jothimurugan et al., 2023, Lee et al., 2021).

7. Relevance and Impact

ASCoT frameworks provide a scalable and general paradigm for robust skill composition across agentic and language domains. By framing robustness, adaptability, and alignment as problems of adversarial compositional generalization, these methods overcome brittleness in prior imitation or task decomposition approaches. Empirical and theoretical advances confirm that adversarial skill compositionality is critical for high performance under distributional shift, novel task sequencing, and adversarial attack generalization. ASCoT has established new performance frontiers in robotic manipulation, continuous-control RL, and LLM jailbreak defense, and continues to motivate active research in compositionality, min–max optimization, and automated skill discovery (Lee et al., 2021, Dabas et al., 24 Oct 2025, Jothimurugan et al., 2023).

Markdown Report Issue Upgrade to Chat

References (3)

Adversarial Skill Chaining for Long-Horizon Robot Manipulation via Terminal State Regularization (2021)

Adversarial Déjà Vu: Jailbreak Dictionary Learning for Stronger Generalization to Unseen Attacks (2025)

Robust Subtask Learning for Compositional Generalization (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adversarial Skill Compositional Training (ASCoT).

Adversarial Skill Compositional Training

1. Core Theoretical Frameworks

2. Formal Objectives and Min–Max Training

3. Algorithms and Training Procedures

4. Empirical Results and Comparative Performance

5. Theoretical Properties and Insights

6. Limitations and Prospective Extensions

7. Relevance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Adversarial Skill Compositional Training

1. Core Theoretical Frameworks

2. Formal Objectives and Min–Max Training

3. Algorithms and Training Procedures

4. Empirical Results and Comparative Performance

5. Theoretical Properties and Insights

6. Limitations and Prospective Extensions

7. Relevance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research