Compositional Reasoning Progression (CRP)

Updated 6 January 2026

Compositional Reasoning Progression (CRP) is the systematic decomposition of complex tasks into simpler sub-problems followed by their recombination to form coherent solutions.
CRP frameworks operationalize methodologies through decomposition, independent sub-task resolution, and strategic integration, leveraging layered neural and symbolic architectures.
Evaluation of CRP employs metrics like AUC and F1 to measure performance decay and gains as task complexity increases, guiding model comparisons and improvements.

Compositional Reasoning Progression (CRP) encompasses the set of frameworks, methods, and theoretical perspectives that formalize, operationalize, and evaluate how complex reasoning emerges through the systematic decomposition and recombination of simpler sub-problems. CRP addresses both model capacity (how and whether models effectively solve incrementally more complex tasks through step-wise composition) and evaluation methodology (how this compositional capability is measured, visualized, and interpreted), with applications spanning LLMs, vision-LLMs (VLMs), program verification, and analogical reasoning. This article synthesizes foundational definitions, architectural principles, benchmark methodologies, empirical findings, and mechanistic insights from major CRP studies.

1. Formal Definitions and Core Principles

Compositional reasoning is defined as the ability to partition a complex problem into simpler components, solve each independently, and integrate the solutions to yield a final answer. CRP refers specifically to the staged or progressive acquisition, demonstration, or measurement of this ability as task complexity, composition depth, or input transformations increase (Shi et al., 8 Feb 2025, Fu et al., 2023, Liu, 20 Oct 2025, Wu et al., 2020).

Operational criteria for CRP include:

Decomposition: Breaking down complex prompts or queries to atomic sub-tasks.
Subproblem Solution: Independently solving decomposed subtasks—e.g., decoding ciphers, extracting entities, solving math problems.
Integration/Summarization: Reconstructing a final response from sub-answers, often involving re-encoding, aggregation, or selecting among several competing outputs.

CRP is instantiated in both algorithmic and neural contexts. In symbolic and formal logic (e.g., program verification), CRP references local compositional inference and the modular combination of proof obligations (Le, 25 Aug 2025). In neural models, it tracks an architecture’s capacity to generalize beyond memorized examples, leveraging layer specialization and modular architectures (Liu, 20 Oct 2025, Fu et al., 2023).

2. Mathematical and Algorithmic Formalizations

Formalization of CRP centers on layered tasks, transforms, and proof criteria:

Task Transformation: For a benchmark $B = \{x_1, \ldots, x_N\}$ , CRP benchmarks apply a sequence of $m$ rules $r_1, ..., r_m$ through function composition:

$x' = r_m \circ r_{m-1} \circ \cdots \circ r_1(x)$

Here, $r_i$ may represent encryption, reformatting, or answer transformation (Shi et al., 8 Feb 2025).

Metric Aggregation: CRP is quantified via accuracy curves as a function of transformation “level” $k$ :

$\mathrm{AUC} = \int_{k_{\min}}^{k_{\max}} f(k) \, dk \approx \sum_{i=1}^{N-1} (k_{i+1}-k_i) \frac{f(k_i) + f(k_{i+1})}{2}$

where $f(k)$ is model accuracy at level $k$ (Shi et al., 8 Feb 2025).

Neural Representation and Modularization: Architectures employ masked self-attention, dynamic module routing, and head-function vectorizations to enforce sub-task specialization. Each module (e.g., attention head) operates with locally computed masks based on functional similarity:

$\alpha_{i, j} = \cos(\mathrm{rep}_{h_i}, \mathbf{f}_j)$

$m_{h_i}(t) = \exp(1 - \max_j \alpha_{i, j})$

(Fu et al., 2023).

Analogical Progression: In structured analogical reasoning, progression is formalized as arithmetic relationships between attribute embeddings across triplets,

$R_{\mathrm{prog}}(a_1, a_2, a_3) = \text{True} \iff a_2 - a_1 \approx a_3 - a_2$

with neural modules tasked to compute progression scores over such triples (Wu et al., 2020).

3. Benchmarks and Evaluation Methodologies

CRP requires challenging benchmarks that systematically vary compositional demands and stratify task complexity:

Benchmark Domain	Subtasks/example CRP axes	Notable Metrics
CryptoBench (LLMs)	Cipher decoding, multi-hop inference, answer reformatting	AUC over encryption levels, exact match (Shi et al., 8 Feb 2025)
NarrativeTrack (VLMs)	Entity persistence, context changes, ambiguity resolution	Macro-averaged accuracy for binary, MC, ordering (Ha et al., 3 Jan 2026)
SCL (Analogical Reasoning)	Object, attribute, relationship factorization	Progression scores, cross-entropy loss (Wu et al., 2020)
MORSE (Structured Proofs)	Modular decomposition of inference types	Steps-F1, Intermediates-F1 (Fu et al., 2023)

Benchmarks apply transformations such as encrypted prompts (e.g., emoji/Morse alphabets), multi-modal fusions, synthetic progression rules, or stepwise expansion of task “depth” (e.g., proof tree height, sequence complexity, entity chain length) to induce and measure CRP. Metrics such as AUC or F1 capture performance decay or gains as compositional difficulty increases.

4. Mechanistic and Representational Insights

CRP exposes interpretable pathways and modular structures in neural models:

Layer-specialization: Principal component and clustering analyses in transformers reveal that compositional generalization is supported by the emergence of modular, hierarchically-specialized layers. Early layers encode raw positional or lexical statistics, intermediate layers realize task-structure or decomposition, and late layers perform integration and summarization (Liu, 20 Oct 2025).
Logit lens and neuron activation: In LLMs, probability mass attributed to decoding tokens peaks early, sub-task solution peaks in mid-layers, and summarization in late layers, aligning with a “decode → solve → summarize” circuit (Shi et al., 8 Feb 2025). Distinct macro-stages are identifiable through activation tracking and token projection.
Module–inference type alignment: Modularized self-attention heads cluster by functional specialization, and ablations or routing failures yield head↔rule or module↔sub-task mappings, confirming emergent compositionality in neural architectures (Fu et al., 2023).

A plausible implication is that such modular and specialized structures parallel symbolic system decompositions, providing a bridge between hand-crafted and learned compositional reasoning.

5. Empirical Findings and Model Comparisons

CRP-centered evaluation has revealed persistent gaps as well as drivers of compositional ability:

Closed- vs. open-source LLMs: Closed-source models (e.g., o1, GPT-4o) systematically outperform open-source LLMs (AUC ≈ 3–4.1 vs. 0.4–2.5) as compositional complexity increases; open-source models suffer steeper accuracy drops under increasing transformation levels (Shi et al., 8 Feb 2025).
General-purpose vs. video-specific MLLMs: Generalist models excel in perceptual and static grounding, while video-specific variants better maintain temporal coherence but are susceptible to context hallucination (Ha et al., 3 Jan 2026).
Compositional modularity: Architectures incorporating explicit module selection, dynamic masking, and multi-stage curricula (e.g., MORSE, SCL) demonstrate enhanced systematicity, productivity (deeper/longer chains), and zero-shot generalization to unseen attribute-relation compositions (Fu et al., 2023, Wu et al., 2020).
Empirical behavioral laws: Model accuracy decays approximately exponentially with composition depth (e.g., branching factor in PCFGs, proof tree depth) and exhibits sigmoidal gains as the number of in-context examples increases; transfer tasks require significantly more context than in-distribution generalization (Liu, 20 Oct 2025).

6. Applications in Program Verification and Logic

Beyond neural models, CRP is foundational for scalable, compositional verification in formal methods:

Concurrent separation logic with permissions: By encoding compositional proof obligations as local operations on heap “regions” through strong ( $\ast$ ) and weak ( $\star$ ) separating conjunctions, and using inference rules to automate frame discovery, CRP enables thread- and function-modular proofs with reduced manual annotation (Le, 25 Aug 2025).
Entailment and framing: FrInfer procedures synthesize residual heap assertions required for compositional verification, allowing the automatic composition and later restoration of modular proof elements.

Key advances include a dramatic reduction in required manual frame annotation (50% to 0%) and verified scalability on concurrent benchmarks.

7. Open Challenges and Future Directions

Current CRP research surfaces challenges and prioritizes several future research directions:

Data and supervision: Rich, explicit compositional chains remain costly to annotate; scalable synthetic data generation and self-verifying pipelines are highlighted as needed advances (Ke et al., 24 Aug 2025).
Dynamic curricula and supervision: Layer-wise and sub-task explicit supervision may enhance internal compositional structure (Shi et al., 8 Feb 2025, Fu et al., 2023).
Multimodal and agentic expansion: CRP frameworks are being extended to multi-modal and agentic settings (images, video, simulation) where models must maintain and manipulate working memory, world-models, and latent simulations (Ke et al., 24 Aug 2025, Ha et al., 3 Jan 2026).
Evaluation and interpretability: Most benchmarks measure only final answers; step-wise faithfulness, causal coherence, and transcript-level correctness are identified as underexplored evaluation axes (Ke et al., 24 Aug 2025).

Research suggests that integrating explicit world models, dynamic task decomposition, and richer compositional supervision will be essential for achieving robust, generalizable, and interpretable compositional reasoning in both neural and formal systems.