Analogical Reasoning in LLMs

Updated 2 December 2025

Analogical reasoning in LLMs is defined as the capacity to map deep structural similarities from a source to a target, enabling transfer and robust generalization.
Methodologies such as analogical prompting, embedding-based retrieval, and structured problem restatement empower LLMs to extract and apply relational patterns.
Empirical benchmarks indicate substantial gains in structural accuracy and generalization, though challenges in robustness and abstraction persist.

Analogical reasoning in LLMs refers to their capacity to encode, retrieve, and apply shared relational structures between disparate problems, domains, or representations. Rather than simple pattern-matching, true analogical reasoning requires mapping structural similarities—often deep or abstract, rather than surface-level—across different contexts, enabling generalization and transfer. Recent research demonstrates that advanced LLMs increasingly exhibit analogical capabilities, with performance now rivaling or surpassing human baselines on a variety of benchmarks, even as core limitations in transfer robustness, mapping fidelity, and abstraction remain unresolved.

1. Conceptual Foundations and Cognitive Formalization

The formal definition of analogical reasoning in LLMs is grounded in classic cognitive theories, notably Gentner’s Structure Mapping Theory, which frames analogy as mapping entities and relations from a source representation $S$ onto a target $T$ , preserving relational structure via a bijection $M$ such that key relations are carried over: $M(r(s_1, ...)) \approx r(M(s_1), ...)$ (Webb et al., 2022, Lee et al., 25 Nov 2025). In LLMs, this manifests as the model’s internal representations capturing not just pairwise or surface similarities, but systematic correspondences between variables, operations, or causal chains—formalized as:

$\text{Analogical Mapping:} \qquad M: S \to T$

where $S$ and $T$ are sets of entities or concepts, and $M$ maximizes the preservation of relations $R_S$ and $R_T$ .

Recent frameworks such as MetaLadder extend this to operationalize the retrieval and adaptation of meta-problems $P_\mathrm{meta}=(Q_\mathrm{meta}, \mathrm{CoT}_\mathrm{meta})$ for transfer to a target problem $P_\mathrm{target}$ via induced, not explicit, mappings (Lin et al., 19 Mar 2025). Analogical structure abduction further formalizes this as an injective mapping over concept sets, maximizing relation overlap (Yuan et al., 2023).

2. Methodological Advances: Prompting and Inductive Pipelines

Analogical reasoning in LLMs is elicited through a diverse array of prompting paradigms:

Analogical Prompting: The LLM is prompted to self-generate or retrieve structurally analogous exemplars and their reasoning traces before solving the target problem, integrating chain-of-thought (CoT) with analogical transfer (Yasunaga et al., 2023, Lin et al., 19 Mar 2025).
Embedding-based Retrieval: Analogues are dynamically selected from a curated library or training corpus using cosine similarity over encoder embeddings, ensuring topological or semantic proximity to the new query (Wang et al., 9 Jan 2024).
Structured Problem Restating: LLMs are explicitly prompted to restate or paraphrase the target, strengthening internal comprehension and focusing attention on the salient structural features (Lin et al., 19 Mar 2025).
Abduction-Deduction Pipelines: Two-stage reasoning, with initial pattern extraction (abduction) from the source and application (deduction) to the target, systematically outperforms direct induction or zero-shot CoT on high-difficulty analogical tasks (Zheng et al., 16 Feb 2025).
Thought Propagation/Analogical Planning: The LLM recursively proposes and solves analogues, propagating their reasoning steps and solution strategies to guide the final answer, mitigating error accrual in deep reasoning (Yu et al., 2023).

These approaches are often deployed in multi-phase, compositional workflows (see pseudocode in (Lin et al., 19 Mar 2025, Yu et al., 2023)), with ablations confirming that tailored, compatible exemplars and explicit structural transfer are critical for gains over baseline prompting.

3. Empirical Performance and Limitations

Meta-analyses across mathematics, scientific analogies, narratives, visual analogies, and strategic decision-making consistently reveal the following performance signatures:

Benchmarks and Aggregate Gains: On mathematical reasoning benchmarks (GSM8K, MATH), frameworks such as MetaLadder and analogical prompting achieve up to 10.3% accuracy gain over standard CoT, closing the human-LLM gap (Lin et al., 19 Mar 2025, Yasunaga et al., 2023). In downstream applications—structured user demands, materials discovery, and strategic simulations—analogical approaches yield substantial improvements in structural accuracy, F1, diversity, and utility of generated hypotheses (Wang et al., 9 Jan 2024, Guo, 25 Oct 2025, Musker et al., 19 Jun 2024, Puranam et al., 1 May 2025).
Scaling and Model-Size Effects: Larger models (LLaMA-70B, GPT-4o) consistently outperform smaller counterparts in analogical, especially narrative, benchmarks (e.g., story analogies: LLaMA-70B > 0.91 accuracy with enhanced prompts vs. $\sim$ 0.74 for 8B models) (Inani et al., 15 Jul 2025).
Fine-grained Human-Model Alignment: While accuracy approaches or exceeds human performance in some settings, item-level or difficulty-wise alignment with human error profiles remains weak—smaller models can sometimes more closely mirror human variability even when less accurate overall (Inani et al., 15 Jul 2025).
Failure Modes: Robustness remains the principal limitation. Under systematic variants—permuted alphabets, non-canonical blanks, and paraphrased narratives—LLMs suffer large accuracy drops ( $\Delta \sim$ 20–40 points), whereas humans remain invariant (Lewis et al., 21 Nov 2024). LLMs preferentially exploit surface cues, with brittleness manifesting as answer-order effects or overreliance on lexical overlap (Lewis et al., 21 Nov 2024, Sourati et al., 2023).
Structural Mapping Gap: LLMs reliably encode relational content in mid-upper layers but frequently fail to apply such relations to novel contexts. These failures trace to insufficient transfer through “link” tokens (e.g., “as”) or suboptimal pathing of abstract representations, as revealed by probe and patchscope analyses (Lee et al., 25 Nov 2025). Activation patching can remediate upwards of half of failures, implicating bottlenecks in relation transfer rather than relation extraction.

4. Applications: From Scientific Discovery to Decision-Making

Analogical reasoning in LLMs is increasingly operationalized in practical domains:

Materials Discovery: Cross-domain analog retrieval identifies design logics from disparate scientific fields (e.g., data-center backbones for robust ion percolation networks), enabling the proposal of battery materials beyond traditional subspace exploration. In-domain template induction via IF–THEN analogical rules fosters interpretable and diverse candidate generation (Guo, 25 Oct 2025).
Business and Strategic Reasoning: In source-to-target matching tasks mimicking strategic problem-solving, LLMs serve as high-recall engines, surfacing a breadth of analogies. Precision in matching structural (causal) analogies, however, remains human-dominated, indicating a complementary “division of labor” in organizational workflows (Puranam et al., 1 May 2025).
Probabilistic Analogy for Uncertainty: Probabilistic factor-profile alignment and KL-based retrieval of structurally similar scenarios drive analogical reasoning for decision-making under uncertainty, yielding gains in accuracy and decision-balance (Hu et al., 2 Oct 2024).
Narrative and Long-Text Analogy: LLMs excel on near, surface-level narrative analogies but struggle with far, system-level mappings, especially when distractors differ only in high-level message or relational structure. Relational CoT and example-based prompting partially ameliorate these gaps (Sourati et al., 2023, Wijesiriwardene et al., 2023).

5. Model Analysis and Interpretability

Interpretability research reveals the inner mechanics and constraints of LLM analogical reasoning:

Layerwise Representation: Attribution and relational information propagate through mid-upper transformer layers, as shown by Patchscopes and activation manipulation. Successful analogical transfer is marked by strong mutual alignment scores (MAS), linking source and target representation via one-to-one structural similarity; failure cases manifest misaligned or degraded MAS (Lee et al., 25 Nov 2025).
Probe and Knockout Techniques: Attention knockout and probe-based separability analyses diagnose which layers and heads encode analogical patterns, and quantify their distinctiveness relative to surface-level distractors. Intervention studies demonstrate that swapping in or patching correct relational vectors into key token spans can rescue performance on up to 60% of previously failed analogies (Lee et al., 25 Nov 2025).
Error Taxonomy and Strategy Characterization: LLMs frequently default to “copy,” “matrix,” or arithmetic heuristics in visual analogies, mirroring young children but distinctly diverging from adult conceptual abstraction (Opiełka et al., 13 Mar 2024). On narrative and strategy tasks, they exhibit a high-recall, low-precision retrieval pattern, whereas humans sparsely select but more reliably align to structural analogs (Puranam et al., 1 May 2025).

6. Open Challenges and Future Research Directions

Despite marked progress, several foundational challenges persist:

Robustness and Transfer: LLMs’ apparent analogical fluency is often fragile, unraveling under controlled perturbations or in far–analogy settings (Lewis et al., 21 Nov 2024). Genuine abstraction—distinguishing deep structure from superficial cues—remains elusive, necessitating new benchmarks and invariance-based evaluation metrics.
Mapping Algorithms and Representation Bottlenecks: Applying extracted relations to novel entities remains a core limitation, driven by the lack of explicit “relation routing” pathways or compositional mapping modules in existing architectures (Lee et al., 25 Nov 2025). Explicit neuro-symbolic hybrids and auxiliary objectives for structural consistency are active areas of research (Yuan et al., 2023).
Task and Data Curation: Many current analogical benchmarks are limited in scope (e.g., 18 items for story analogies, simple proportional or word-level pairs); larger, more diverse, and structurally challenging datasets (e.g., AnaloBench, ParallelPARC) are needed to advance the field (Sourati et al., 2023, Wijesiriwardene et al., 2023).
Prompting Strategy Optimization: Optimal analog retrieval exploits both relevance and accuracy; however, studies show that on mathematical tasks, diversity and correctness of demonstrations matter more than semantic relevance, challenging naive analogical assumptions (Qin et al., 19 Apr 2024).
Interpretability and Human Alignment: Further mechanistic dissection—linking induction heads, attention patterns, and relational vector paths to analogical processing—is required to bridge the gap to human-like generalization (Musker et al., 19 Jun 2024). Benchmarks emphasizing alignment to human error profiles and dynamic response to “adversarial” or “distractor” probes should become a standard part of LLM evaluation (Inani et al., 15 Jul 2025).
Scalability and Cost: While analogical prompting and meta-reasoning frameworks offer significant gains, they often incur nontrivial computational cost due to increased token lengths or multiple inferences. Shortcut inference and knowledge distillation remain promising avenues for practical deployment (Lin et al., 19 Mar 2025, Wang et al., 9 Jan 2024).

7. Synthesis and Outlook

Recent advances demonstrate that analogical reasoning in LLMs is both emergent and malleable, showing remarkable progress in pattern extraction, solution transfer, and structured mapping across problem formats. Meta-architectures such as MetaLadder and plug-in planning frameworks deliver consistent, interpretable gains across mathematics, scientific discovery, and narrative reasoning, even as open problems of robustness, compositional mapping, and deep abstraction persist. Addressing these will require new benchmarks, explicit relation-routing mechanisms, and hybrid symbolic–neural approaches, as well as principled evaluation of alignment to human analogical cognition at both aggregate and item-resolved granularity (Lin et al., 19 Mar 2025, Lewis et al., 21 Nov 2024, Lee et al., 25 Nov 2025). The research trajectory suggests a convergence towards LLMs capable of human-comparable analogical competence, with implications for scientific innovation, decision support, and automated generalization.