STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

Published 20 Apr 2026 in cs.CL and cs.AI | (2604.18177v2)

Abstract: Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model's unique and distinct skill gaps.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces STaD, a methodology that decomposes complex tasks into sequential sub-tasks to diagnose compositional reasoning failures in LLMs.
It employs a three-role framework with teacher, judge, and target models to generate scaffolded variants and quantify the minimal guidance needed for task solvability.
Empirical results demonstrate that meaningful scaffolding nearly doubles success rates and exposes critical skill-specific bottlenecks overlooked by aggregate scores.

Scaffolded Task Design (STaD): Systematic Identification of Compositional Skill Gaps in LLMs

Motivation and Framework

Benchmarking LLMs via aggregate scores obscures systematic reasoning deficiencies and compositional failure modes. Conventional evaluations fail to capture how isolated skill competence translates to multi-step ability. STaD proposes a principled methodology combining decomposition and scaffolding: tasks are partitioned into sequential sub-tasks, enabling controlled injection of intermediate answers at each stage. Models are then probed for minimum scaffolding required to solve those tasks, exposing bottlenecks in skill composition and situated reasoning.

The operational pipeline is instantiated as three LLM roles: a teacher model (generates decompositions, intermediate answers, and scaffolded variants), a judge model (evaluates correctness via consistency adjudication), and a target model (subject to compositional diagnosis). Scaffolded variations are generated for each benchmark problem, with controlled provision of intermediate sub-task solutions, rigorously verified for consistency to prevent decomposition-induced bias. Sub-tasks are further abstracted to skill clusters via semantic embedding and agglomerative clustering, enabling compositional skill-level analysis.

Figure 1: Scaffolded tasks for identifying compositional skill gap s2 (Subtraction: Eggs remaining for sale).

Experimental Setup and Dataset Construction

STaD is empirically applied to ToT Arithmetic (Fatemi et al., 2024), GSM8K (Cobbe et al., 2021), and Math-Hard (Hendrycks et al., 2021), benchmarks requiring complex arithmetic, temporal, and algebraic reasoning. Scaffolded datasets are filtered for teacher-consistent decomposition, yielding high step coverage ( $>$ 87\%) and low redundancy ( $<$ 8\%) across domains. Cluster-driven skill abstraction is validated with exhaustive granularity ablations; the selected granularity aligns with interpretable, non-redundant skill representations, confirmed by direct cluster-to-skill-mapping visualizations.

Figure 2: Other-category coverage across (M, N) configurations for each dataset, showing improved granularity reduces catch-all assignment.

Quantitative and Bottleneck Analysis

STaD's scaffolded evaluation delineates situated skill failures. Baseline model accuracy on GSM8K is 78.4–93.6%, but drops substantially in ToT and Math-Hard (18–46%). Under scaffolding, accuracy spikes—this is not model improvement but unlocks conditional competence through intermediate support. Ablation confirms gains are strictly due to meaningful scaffolding, not superficial question restructuring: replacing intermediate values with placeholders drops accuracy to baseline.

Figure 3: Original vs. scaffolded performance across benchmarks, establishing the diagnostic power of scaffolding for latent competence.

Skill-level analysis reveals pronounced weaknesses overlooked by aggregate scores. For ToT Arithmetic, bottleneck frequencies are highest in Overlap/intersection, Natural-language parsing, Calendar arithmetic, and Discrete slot counting. In Math-Hard, “Translating word problems into algebra” emerges as the dominant bottleneck. GSM8K bottlenecks are less frequent, but extracting quantitative information and sequential tracking expose lower resilience.

The minimum scaffolding level $k$ quantifies guidance required for task solvability: $k=0$ denotes independent solution, $k>0$ solvable with partial scaffolding, $k=-1$ intractable even with full support. Combinatorial analysis of recurrent skill sets exposes both universal and model-specific compositional weaknesses. For example, Overlapping time + Discrete slots + JSON remains challenging for all models, consistently requiring multiple scaffolded hints. However, same combinations yield divergent performance between models: Qwen variants handle Unit Conversion + Arithmetic combinations with lower $k$ , while Llama and Granite models display substantially higher intractability rates.

Figure 4: Frequency of compositional-skill bottlenecks in three benchmarks, differentiating models by where their reasoning fails first.

Qualitative Implications and Theoretical Discussion

STaD enables precise diagnosis of situated reasoning: equivalent aggregate scores can mask distinctly different bottlenecks and compositional deficits. Skill-level probes systematically inflate competence estimates; when skills are required in context, performance degrades, pointing to brittle integration and sequencing rather than missing primitives.

Failures predominantly arise from skill interactions, not isolated deficits. The statistical analysis demonstrates that compositional support (scaffolding multiple upstream skills together) nearly doubles success rates for complex tasks compared to isolated scaffolding. The intractable cases under full scaffolding suggest unsolved challenges in intermediate synthesis or decompositional planning, underscoring that compositional failures propagate nonlinearly across task structure.

STaD's complementary approach to individual skill testing and situated scaffolding provides actionable diagnosis: targeted synthetic generation at the skill-combination level, compositional bottleneck reporting protocols, and explicit training on skill coordination. As scaffolding distinguishes learnable—from fundamentally intractable—cases, it reveals priorities for intervention and curriculum construction.

Figure 5: Model performance distribution across ToT Arithmetic, partitioned by skill bottleneck frequency.

Limitations and Future Directions

Reliance on teacher model quality constrains decomposition consistency; empirical robustness shows high cross-teacher agreement, but STaD remains agnostic to teacher choice and adaptable to ensemble or human-in-the-loop schemes. The approach is optimal for multi-step reasoning with structured benchmarks; its efficacy decreases for open-ended generation or low structural overlap, as in some Math-Hard and GSM8K tasks. Filtering for teacher-consistent scaffolding introduces selection bias, confirmed to be minimal for most datasets but more pronounced in Math-Hard.

Generalizing STaD to broader domains—complex tool-calling, instruction following, or multi-agent planning—remains a direct extension; its scaffolded competence diagnostic is portable across any domain with well-defined intermediate structure.

Conclusion

STaD provides a methodologically rigorous framework for diagnosing compositional skill gaps in LLMs, enabling fine-grained analysis of situated competence beyond aggregate scores. This systematic scaffolding protocol reveals model- and skill-specific bottlenecks, identifies learnable versus intractable cases, and prescribes targeted evaluation and training strategies for robust multi-step reasoning. All code and datasets are released to facilitate reproducibility and extension (2604.18177).

Markdown Report Issue