Compositional Skill Acquisition in LLMs

Updated 1 October 2025

Compositional skill acquisition is the process by which LLMs synthesize basic, isolated abilities into coherent strategies for complex, out-of-distribution tasks.
Empirical evaluations reveal that while LLMs excel at isolated skills, they experience steep performance drops—often to 20–40% accuracy—when integrating these skills.
Advanced approaches like modular prompting, data augmentation, and reinforcement learning are showing promise in enhancing systematic generalization in diverse applications.

Compositional skill acquisition in LLMs refers to the process by which models learn to combine individual skills, reasoning steps, or modular components—often learned in isolation—into novel, coherent strategies for solving more complex, out-of-distribution problems. Compositionality, as a core property of human cognition, enables infinite productivity from finite elements and is considered an essential prerequisite for systematic generalization and high-level reasoning in NLP, robotics, program synthesis, vision-language reasoning, and mathematics. Recent research has empirically, theoretically, and algorithmically evaluated the degree to which LLMs (and multimodal extensions) exhibit compositionality, provided methodologies for compositional dataset construction and prompting, analyzed gaps and deficiencies, and introduced frameworks for compositional skill training, inference, and evaluation.

1. Definitions and Conceptual Axes

Compositional skill acquisition refers to a model’s ability to synthesize previously acquired “atomic” skills—such as basic string transformations, arithmetic reasoning, visual grounding, or symbolic manipulations—into solutions for previously unseen compositions. Formally, if $f(x)$ and $g(x)$ are known to the model, the compositional task is to solve $h(x) = g(f(x))$ even if $h$ itself was never trained directly (Yuan et al., 29 Sep 2025). Axes along which compositional acquisition is evaluated include:

Systematic compositionality: The systematic reuse and recombination of known primitives or skills for novel configurations (Zhao et al., 5 May 2024).
Skill composition: The chaining or functional composition of modular procedures or reasoning steps (e.g., combining geometric and combinatorial reasoning) (Sun et al., 23 Jun 2025).
Compositional generalization: Performance on out-of-distribution tasks that require integrated application of multiple subskills, not just simple in-distribution pattern matching (Fu et al., 8 Oct 2024, Sun et al., 23 Jun 2025).
Compositional meta-skills: Higher-order rules that govern how skills are composed beyond mere rote execution of observed patterns (Zhao et al., 29 Sep 2024).

2. Empirical Evaluation and Benchmarking

Empirical studies consistently find a pronounced gap between in-domain generalization (on tasks similar to training data) and true compositional generalization. Key methodologies and metrics include:

Systematic synthetic tasks: Controlled frameworks such as SKILL-MIX (for language skills) and MathTrap (injecting logical traps into math problems) probe a model’s ability to integrate or transfer skills beyond memorized cases (Zhao et al., 29 Sep 2024, Zhao et al., 5 May 2024).
Graph-based and visual reasoning: CGGC (Compositional Generalization Challenge for Graph-based Commonsense Reasoning) requires models to verbalize novel tuples of relations in a reasoning graph, measuring the ability to compose seen primitives into new relational configurations (Fu et al., 8 Oct 2024).
Compositional translation and code synthesis: Multi-stage tasks that decompose complex inputs (e.g., sentences, program specs) into segments or subtasks, and then measure model accuracy and transfer in recomposing those pieces (Zebaze et al., 6 Mar 2025, Khan et al., 12 Mar 2025, Almorsi et al., 11 Jan 2025).

Results converge on the finding that standard LLMs—even when highly capable in base tasks—often exhibit significant performance degradation when confronting novel compositional cases:

For complex reasoning that combines multiple newly-mixed skills, top-tier LLMs experience sharp performance drops (in some cases to less than 20–40% accuracy) despite near-perfect results for each skill in isolation (Xu et al., 22 Jul 2024, Fu et al., 8 Oct 2024).
Skill composition performance depends critically on whether component tasks affect non-overlapping supports in the embedding space (separable vs. sequential tasks) (Xu et al., 22 Jul 2024).

3. Algorithmic and Training Approaches

A diverse array of strategies have been tested to improve compositional skill acquisition, including:

Prompting strategies: Skills-in-Context (SKiC) prompting provides modular descriptions and exemplars for atomic and composite skills in a single prompt; this has produced near-perfect systematic generalization in some domains and leverages latent skill activation in pretrained LLMs (Chen et al., 2023).
Data augmentation: Techniques such as Component Substitution (CompSub) and Learning Component Substitution (LCS) generate multi-grained, recombined training examples by swapping components across examples, thereby introducing a compositional inductive bias and implicitly regularizing the hypothesis class (Li et al., 28 Feb 2025).
Reward-based learning/RL: Reinforcement learning post-training, especially with carefully designed rewards for compositional tasks, is shown to teach genuinely new skills (rather than just reweighting patterns), enabling models to learn unseen compositions and transfer these meta-skills to new domains (Yuan et al., 29 Sep 2025, Huang et al., 18 Dec 2024, 2505.19406). RL objectives can explicitly reward correct multi-step composition without exposing intermediate traces.
Curriculum and meta-learning: Ordering demonstration examples from “easy-to-hard” by compositional structure (e.g., transitive to common target in a reasoning graph) improves generalization (Fu et al., 8 Oct 2024). Curriculum scaffolding and meta-reasoning controllers are proposed as future research trajectories (Sun et al., 23 Jun 2025).

Table: Representative Approaches

Approach	Key Mechanism	Empirical Effect
Skills-in-Context (SKiC)	Prompt module+composition	Near-perfect system. gen. (Chen et al., 2023)
CompSub / LCS	Multi-grained compositional aug.	Gains up to 66.5% on SCAN (Li et al., 28 Feb 2025)
RL post-training	Reward for composition correctness	Enables generalization to $g(f(x))$ (Yuan et al., 29 Sep 2025), boosts compositional reasoning in VLMs (2505.19406)
Easy-to-hard curriculum	Demo order by composition diff.	Improved CGGC performance (Fu et al., 8 Oct 2024)

4. Theoretical Perspectives and Analysis

Theoretical work has sought to formalize conditions under which compositionality emerges or fails in LLMs:

Embedding “confined support”: Compositional generalization is successful primarily when the component tasks act on disjoint subspaces of the input embedding. This is formally captured as increased compositional accuracy when $Acc_\theta(S_k) + Acc_\theta(S_g) \leq Acc_\theta(S_{k \cup g})$ (Xu et al., 22 Jul 2024).
Second-order loss approximation: Composability of modules is theoretically justified when fine-tuned parameter updates (task vectors) remain close to the pretraining initialization (pretraining basin). Mathematically, for parameter displacements $\Delta_t$ :

$\ell(\theta_P) \lesssim \sum_t w_t \ell(\theta_0 + \Delta_t)$

(where $\ell$ is the task loss and $\theta_P$ is a convex combination of modules), establishing that quadratic loss curvatures support modular composition (Porrello et al., 25 May 2024).

Data augmentation as regularization: Theoretical results show that span substitution augmentation (CompSub) is tantamount to adding a term encouraging invariance to context for exchanged components, thereby explicitly promoting compositional structure in the learned representations (Li et al., 28 Feb 2025).

5. Modalities Beyond Language: Robotics, Vision, and Program Synthesis

Compositional acquisition is an active research area across multiple domains:

Robotic skill composition: LLMs are used as high-level planners to recursively decompose natural language task instructions into subtasks, autonomously write code snippets for task success conditions, and facilitate data labeling for multi-task visuo-motor policies (Ha et al., 2023). The resulting modularity enables robust, retry-capable, and reusable skill acquisition.
Programs and planning: LLM-guided program synthesis and plan generation frameworks use agentic decomposition, bottom-up composition, and process mining to discover, store, and retrieve modular “skills” as process models, thus making program generation more interpretable and compositional (Khan et al., 12 Mar 2025, Almorsi et al., 11 Jan 2025, Redis et al., 14 Oct 2024).
Visual reasoning and vision-language modeling: Controller-type LLMs coordinate abstract spatial/temporal routines and visual tools. Automated generation of in-context exemplars, use of task abstraction, and explicit reward for cross-modal grounding boosts performance on compositional visual tasks (Stanić et al., 3 Jan 2024, 2505.19406).

6. Open Issues and Limitations

Despite recent algorithmic advances, critical challenges and deficiencies remain:

Compositionality gap: LLMs and VLMs often fail to exhibit spontaneous, systematic composition of skills under out-of-distribution composition—a gap quantified by performance drops as task complexity increases, or when primitive skills must be integrated beyond memorized templates (Zhao et al., 5 May 2024, Sun et al., 23 Jun 2025).
Scaling and instruction-tuning tradeoffs: Scaling up model parameters generally improves compositionality in some domains; however, instruction-tuning can sometimes degrade compositional behavior even as it increases alignment and in-domain performance (Dhar et al., 18 Jul 2024).
Task structure sensitivity: Compositional performance strongly depends on whether subskills are separable or require sequential/entangled reasoning. Models excel when tasks decompose into non-interfering segments, but fail to compose sequential, entangled, or structurally aligned sub-skills (Xu et al., 22 Jul 2024, Sun et al., 23 Jun 2025).
Transfer and meta-compositionality: RL-based compositional training enables transfer of a composition meta-skill across domains when atomic skills are shared, but next-token objectives fail to produce such transfer (Yuan et al., 29 Sep 2025).

7. Implications, Applications, and Research Directions

Enhanced compositional skill acquisition in LLMs underpins advancements in robust NLP, code generation, cross-modal reasoning, scientific discovery, robotics, and agentic AI systems. Promoting compositionality involves targeted fine-tuning, curriculum staging, data augmentation, and RL in composition-specific regimes, and the deployment of structured prompting and modular architectures.

Future directions include:

Dynamic and meta-learning curricula for progressive skill integration (Sun et al., 23 Jun 2025);
Data-efficient “skill-rich” synthetic augmentation for flexible meta-skill induction (Zhao et al., 29 Sep 2024);
Integration of verifiable intermediate rewards and explicit grounding in multimodal systems (2505.19406);
Theoretical exploration of disentangled representations, task arithmetic, and regularization bounds for compositional learning (Porrello et al., 25 May 2024, Li et al., 28 Feb 2025).

Progress in these areas will determine the capacity of future LLMs to match or exceed the systematic and creative composition capabilities of human cognition.