Compositionality & Multi-Hop Failures

Updated 10 March 2026

Compositionality and multi-hop failures refer to the challenge of building complex outputs from simpler learned units, with performance degrading as reasoning hops increase.
Empirical analyses show a persistent accuracy gap between single-hop and multi-hop tasks, revealing issues like shortcut exploitation, recognition bottlenecks, and overthinking.
Mitigation strategies including chain-of-thought prompting, external knowledge injection, and back attention have demonstrated improvements in bridging the compositionality gap.

Compositionality is the ability to build complex outputs by combining simpler learned units or functions, and multi-hop reasoning is a core paradigm for compositionality in language, vision, and mathematical domains. However, modern neural models, especially transformer-based LLMs and vision-LLMs, routinely struggle to maintain compositional integrity as the number of reasoning “hops” increases. These multi-hop failures manifest as a persistent performance gap between single-hop and compositional tasks, a failure to propagate local updates in knowledge across compositions, and a brittleness to adversarial prompt manipulations.

1. Formal Definitions: Compositionality and Multi-hop Reasoning

The canonical formalization defines a k-hop compositional question as one whose answer can be expressed as a function $A=g(r_1, r_2, ..., r_k)$ , where each intermediate result $r_i=h_i(Q, P_i)$ is obtained by applying a local function $h_i$ to sub-question $Q$ and supporting evidence $P_i$ (Min et al., 2019, Yadav et al., 6 Aug 2025). Multi-hop question answering (QA) or inference requires chaining these intermediate results, often across multiple documents or steps. The “gold path” is the exact sequence of reasoning hops needed to answer the question, and compositionality requires both correct retrieval of all required evidence units and correct integration via the composition function $g$ (Yadav et al., 6 Aug 2025).

A fundamental metric used to quantify compositionality in models is the “compositionality gap” (Press et al., 2022):

$\text{Gap} = 1 - \frac{\text{Multi-hop Accuracy}}{\text{Single-hop Accuracy}}$

where the gap measures the conditional probability that a model fails to answer the full question despite solving all of its sub-questions.

Rigorous frameworks have been proposed to analyze the theoretical structure of compositional models. The neuro-symbolic framework of (Ram et al., 2024) formalizes a compositional function $f$ as

$f(X) = h \left( g^{\otimes D(X)}(e(x_1), ..., e(x_L)) \right)$

where $D(X)$ is an input-conditional computation DAG specifying the sequence of compositions, $r_i=h_i(Q, P_i)$ 0 is a shared composition operator, and $r_i=h_i(Q, P_i)$ 1 is a local encoder. The compositional complexity of $r_i=h_i(Q, P_i)$ 2 is measured via the Locus of Influence (LoI), which enumerates the length and branching of reasoning chains per input token.

2. Empirical Characterization of the Compositionality Gap

Extensive empirical analyses demonstrate that LLMs and vision-LLMs like CLIP show increasing single-hop accuracy with scale, but their accuracy on multi-hop (compositional) questions lags and the gap persists across orders of magnitude in model size (Press et al., 2022, Yu et al., 15 Feb 2025, Kudo et al., 2023). For example, in language modeling, single-hop QA can reach $r_i=h_i(Q, P_i)$ 3– $r_i=h_i(Q, P_i)$ 4\% for large models (Davinci-002), while multi-hop accuracy plateaus at $r_i=h_i(Q, P_i)$ 5– $r_i=h_i(Q, P_i)$ 6\% on hard composition benchmarks, yielding a persistent compositionality gap of about $r_i=h_i(Q, P_i)$ 7\% (Press et al., 2022). Analogously, seq2seq transformers struggle most with the systematicity dimension in arithmetic reasoning (i.e., recombining known primitives into new compositions); productivity (generalizing to longer chains) and substitutivity (handling novel variable names) are less problematic (Kudo et al., 2023).

In vision-language settings, token-level causal modeling shows that CLIP’s contrastive objective supports pseudo-optimal encoders that are insensitive to SWAP, REPLACE, and ADD operations on compositional structures: a bag-of-words matcher achieves alignment but is brittle to concept reordering or augmentation, leading to failures on adversarial “hard negative” instances (Chen et al., 30 Oct 2025).

3. Psychological and Architectural Sources of Multi-hop Failure

Several mechanistic and theoretical perspectives have been developed to explain the compositionality gap:

Shortcut exploitation: Pretrained LLMs often memorize composite relations as direct shortcuts, bypassing chains of atomic facts. When an update is made to a single-hop component, the model may continue to output the old answer, ignoring the new local knowledge and breaking compositional faithfulness (Ju et al., 2024). These “factual shortcuts” are correlated with high co-occurrence of the start and end entity in pretraining corpora, and can be detected by low overlap $r_i=h_i(Q, P_i)$ 8 between knowledge neurons used for the composite and the constituent hops. Shortcuts account for $r_i=h_i(Q, P_i)$ 9 of multi-hop editing failures in standard benchmarks.
Probabilistic recall and extraction: The recall-extract model (Liu et al., 7 Jan 2026) posits that entity recall is a diffuse, probabilistic process carried out by MLP layers, and answer extraction is performed by downstream attention. Multi-hop failures occur both when recall is too diffuse (“recall-level”) and when extraction lacks sufficient sharpening (“extract-level”). Layer-order inversion—when deeper-hop answer entities become decodable before bridge entities—arises and strengthens with hop count, contradicting the hop-aligned computation hypothesis.
Recognition bottleneck and position bias: Empirical probing shows multi-hop accuracy collapses to that of the least “visible” evidence—the “Weakest Link Law” (Zhang et al., 18 Jan 2026). Absolute position in context, rather than local distance, governs reasoning bottlenecks. System-2 “thinking” models with chain-of-thought verification steps overcome position bias, matching two-gold-document upper bounds even in the presence of noisy distractors.
Architectural compositional complexity: The formal complexity C(f) of compositional models grows rapidly with hop count and is highly sensitive to the symbolic routing in the computation DAG (Ram et al., 2024). Dense transformers or RNNs with input-agnostic routing cannot efficiently implement deep, input-specific reasoning chains, resulting in degraded approximation and generalization on multi-hop tasks.
Coverage and overthinking: Multi-hop failures are also diagnosed along three axes—incorrect hop count (“underhopping” or “overhopping”), incomplete coverage (missing one or more requisite evidence pieces), and cognitive inefficiency (“overthinking”: revisiting unneeded entities or making spurious extra hops) (Yadav et al., 6 Aug 2025). For dense inference tasks (MuSiQue), overthinking rates reach $h_i$ 0, and fidelity of the retrieved hop chain drops sharply after two hops.

4. Failure Modes and Diagnostics

The concrete breakdowns that comprise multi-hop failures have been systematically enumerated across several recent large-scale diagnostic benchmarks (Gupta et al., 20 May 2025, Yadav et al., 6 Aug 2025). Common failure types include:

Missed final-hop composition: Models chain the first $h_i$ 1 paragraphs/facts but omit the required final-hop integration; this is particularly pronounced at higher hop counts and with longer input contexts (Gupta et al., 20 May 2025).
Entity confusion and coreference drift: The model substitutes a similar or previously mentioned entity in the answer, often due to inadequate tracking across chain steps (Gupta et al., 20 May 2025).
Partial coverage: At least one required document/fact is not retrieved; the answer composition is thus incomplete (Yadav et al., 6 Aug 2025).
Early or trailing irrelevance (“overthinking”): Extra hops are interleaved before or after the gold path, indicating a failure to determine when to stop or backtrack (Yadav et al., 6 Aug 2025).
Recognition vs. synthesis failure: For some QA tasks, performance can be largely restored by explicit attention steering (MFAI) to the gold supports, indicating that recognition, not integration, is often the limiting factor. In others, even perfect recognition leaves a residual gap, exposing synthesis or integration limitations (Zhang et al., 18 Jan 2026).

The prevalence of these errors is task- and model-dependent but increases with reasoning depth and context length. For example, in multi-hop QA over 128k-token novels, the final-hop integration and drift errors account for the majority of failures at 4-hop depth (Gupta et al., 20 May 2025).

5. Theoretical and Causal Perspectives on (Non-)Compositionality

Theoretical analyses formalize how models’ expressivity with respect to compositional functions is tightly controlled by their symbolic routing and depth (Ram et al., 2024). Key observations include:

Sequence models with input-agnostic computation graphs (dense attention, classic RNNs) cannot efficiently or reliably emulate input-specific multi-hop reasoning—approximation error increases with the compositional complexity $h_i$ 2 required by the task.
Systematic generalization decays with $h_i$ 3: as the number of reasoning hops (or recombinations) required at test time increases beyond what the models see at training time, generalization error grows, and systematicity failures dominate (Kudo et al., 2023).
In cross-modal setups, block-identifiability theorems guarantee recovery of shared semantic latent spaces only up to invertible heads, not of the precise compositional decomposition; hence composition nonidentifiability persists and is compounded with multiple hop-like operations (Chen et al., 30 Oct 2025).

6. Mitigation Strategies and Architectural Remedies

Several approaches demonstrate empirical or theoretical promise in narrowing the compositionality gap and repairing multi-hop failure modes:

Elicitive prompting: Chain-of-thought (CoT) and self-ask prompting make intermediate reasoning explicit, raising multi-hop accuracy by up to $h_i$ 4 percentage points and reducing the compositionality gap below $h_i$ 5\% for hard QA datasets (Press et al., 2022). The self-ask approach, in particular, enables the injection of external answer modules.
External knowledge injection: Routing model-generated sub-questions to a search engine or external memory increases accuracy in compositional QA benchmarks by an additional $h_i$ 6– $h_i$ 7\% (Press et al., 2022).
Factual shortcut erasure: Identification and ablation of neurons corresponding to composite shortcuts decreases shortcut-induced editing failures by $h_i$ 8– $h_i$ 9\% and improves propagation of local edits (Ju et al., 2024).
Memory injection: Direct insertion of intermediate facts into attention layers recovers correct multi-hop outputs and boosts next-token probabilities by factors up to $Q$ 0 in controlled studies (Sakarvadia et al., 2023).
Back attention and layered re-entry: Allowing lower layers to attend to higher-layer hidden states (rather than the standard bottom-up stack) restores missing intermediate representations; retrofitted back attention modules raise multi-hop accuracies from $Q$ 1 to $Q$ 2\% in arithmetic and reasoning tasks, at negligible parameter cost (Yu et al., 15 Feb 2025).
Intermediate-hop supervision and stopping criteria: Fine-tuning models to explicitly verify and chain-hop sub-answers at each step, and introducing learned stopping criteria, reduces overhopping and increases reasoning fidelity (Gupta et al., 20 May 2025, Yadav et al., 6 Aug 2025).
Hard-negative mining with iterated compositional operations: In contrastive language-image pretraining, systematic generation of hard negatives using multi-hop permutations (SWAP, REPLACE, ADD) improves compositional robustness of encoders (Chen et al., 30 Oct 2025).

7. Benchmarks, Open Problems, and Future Directions

Advances in multi-hop QA and compositionality research are increasingly driven by diagnostic datasets that carefully control hop depth, context length, evidence separation, and adversarial distractor configurations. Notable contributions include NovelHopQA for long-context narrative QA (Gupta et al., 20 May 2025), logit-level interpretability and recall-extract frameworks (Liu et al., 7 Jan 2026, Yu et al., 15 Feb 2025), and token-level causal probes for vision-language brittleness (Chen et al., 30 Oct 2025).

Persistent open challenges include:

Closing the systematicity gap: generalizing reliably to unseen recombinations of known primitives (Kudo et al., 2023).
Internalizing stepwise compositional chains, rather than depending on shortcut memorization or prompt scaffolding (Ju et al., 2024).
Dynamic, input-dependent computation graphs or symbolic skeletons that align with multi-hop demands (Ram et al., 2024).
Robust tracking, coreference, and memory across long sequences and multiple reasoning steps (Gupta et al., 20 May 2025).

Development of architectures and training paradigms that directly minimize compositional complexity, foster symbolic modularity, and enforce input-conditioned routing remains critical for achieving reliable, scalable multi-hop reasoning in future models.