Task Analogies in Structured Reasoning

Updated 26 June 2026

Task analogies are structured reasoning tasks that require mapping deep relational correspondences between a source and a target despite superficial differences.
They are applied in cognitive and AI research to evaluate transfer, abstraction, and systematicity across visual, linguistic, multimodal, and system-level domains.
Evaluation methods focus on structural alignment, sound mapping, and resistance to distractors, driving advancements in analogical modeling and benchmark design.

Task analogies are a family of structured reasoning tasks in which an agent is presented with at least two situations—typically a “source” and a “target”—and is required to identify, generate, or select a mapping between them such that deep structural relations are preserved across surface differences. Task analogies constitute a core probe of human and machine capacity for transfer, abstraction, and systematicity, providing a rigorous arena for benchmarking both cognitive theories and AI models. The formal structure of these tasks ranges from the canonical proportional format (A : B :: C : ?) to more complex mappings between multi-entity systems, narratives, or even process descriptions, encompassing both symbolic and sub-symbolic domains.

1. Formal Definition and Theoretical Foundations

Task analogies operationalize the concept of analogical mapping: finding a correspondence between elements, relations, or structures of a source domain and a target domain such that relational alignment is maximized. The most basic instance is the proportional (four-term) analogy: $A:B\,::\,C:?$ where the solver must select or generate $D$ such that the relation $A:B$ “stands in the same” relationship as $C:D$ . Structure Mapping Theory (SMT) provides the canonical framework: let $S = (C_S, R_S)$ and $T = (C_T, R_T)$ be the source and target, with sets of components and relations. An analogy is a mapping $M: C_S \rightarrow C_T$ such that for each $r \in R_S$ , a corresponding $r' \in R_T$ exists with $M$ applied to all arguments (structural alignment), and maximally many relations are preserved (Barakat et al., 22 May 2026).

Task analogies generalize to:

Visual (e.g., images, shapes, graphs)
Linguistic (words, sentences, paragraphs)
Multimodal (e.g., mapping scientific processes to visual grids (2505.20672))
System-level (matching ontology components or procedural steps (Yuan et al., 2023, Sultan et al., 2022))

Key metrics include mapping soundness, system accuracy (all sub-mappings correct), and structural alignment scores (Barakat et al., 22 May 2026, Yuan et al., 2023).

2. Experimental Paradigms and Task Variants

Task analogies appear in diverse settings:

Proportional analogies: Classic “A is to B as C is to D” questions, found in SAT, Google, and BATS benchmarks; evaluated via multiple-choice selection (Ushio et al., 2021, 0809.0124).
System/system analogies: Mapping between multi-component systems, e.g., “camera” components to “eye” components, scored by full system mapping accuracy (Yuan et al., 2023).
Narrative analogies: Aligning two stories at the level of moral, structure, or event chain, e.g., distinguishing “near” (surface+system) from “far” (system-only) analogies (Sourati et al., 2023, Nagarajah et al., 2022).
Scientific/process analogies: Mapping entities and relations across domains, e.g., blood:heart::water:pump, evaluated via relational similarity of process structure (Sultan et al., 2022).
Complex distractor schemes: Advanced datasets include distractors requiring true mapping of relations, not just attribute or surface cues (e.g., order-swapped processes in ParallelPARC (Sultan et al., 2024)).
Visual analogies: Identify the image that stands in the same relational transformation as others (e.g., VASR, (Bitton et al., 2022); compositional part-based models (Ichien et al., 2021)).

Typical experimental formats include binary choice, multiple-choice, open-ended mapping, or structured extraction. Prompting configurations (zero-shot, few-shot, chain-of-thought) and evaluation settings are tuned to probe relational abstraction, not just lexical or perceptual overlap.

3. Modeling Approaches and Evaluation Methodologies

Task analogies have driven the development of a diverse modeling toolkit:

Approach	Structural Features	Empirical Findings
Embedding-based (SBERT, etc.)	Cosine similarity on sentence or concept embeddings	Strong at “self” or near analogies, fail on far/system analogies (Sultan et al., 2022, Sourati et al., 2023)
Compositional models	Explicit difference or relation vectors, e.g., part-based, vector subtractions	Capture main human effects; resist non-relational shortcuts (Ichien et al., 2021)
LLMs	End-to-end or prompt-based analogy solution	Match/exceed human accuracy on easy analogies; struggle with deep system or cross-domain mapping (0409.0124, Inani et al., 15 Jul 2025, Lee et al., 25 Nov 2025)
Supervised classifiers	Pattern-based SVMs or ML ranking using phrasal context	Effective for word pairs, less so for higher-order structure (0809.0124)
Search/abduction pipelines	Beam search over mappings or structure abduction	Robust to paraphrase, achieve high mapping accuracy on system analogies (Yuan et al., 2023, Sultan et al., 2022)

Recent work systematizes evaluation via:

System accuracy and mapping completeness (Yuan et al., 2023)
Human-LM concordance on item-level patterns, not just aggregate accuracy (Inani et al., 15 Jul 2025, Barakat et al., 22 May 2026)
Chain-of-thought and self-hint prompting to surface relational structure (Inani et al., 15 Jul 2025, Sourati et al., 2023)
LLM-as-judge frameworks, cross-validated against human rankings for explanation quality and mapping soundness (Barakat et al., 22 May 2026)
Adversarial or structured distractor regimes to pressure models beyond shortcut strategies (Sultan et al., 2024, Bitton et al., 2022)

4. Empirical Insights and Failure Modes

Despite substantial progress, several empirical regularities distinguish human and model performance in task analogies:

Surface vs. deep structure: Both human novices and LLMs tend to rely on surface cues unless forced to abstract relational structure; far/system analogies depress accuracy sharply (Sourati et al., 2023, Sultan et al., 2022, Nagarajah et al., 2022).
Distractor susceptibility: LLMs often select “hard distractors” that preserve surface or first-order role similarity but break higher-order relational structure; humans are more robust in this regime (Sultan et al., 2024, Bitton et al., 2022).
Structural alignment signatures: Correct analogies exhibit high mutual alignment in neural representations (MAS), whereas failures reflect degraded or misplaced alignment (Lee et al., 25 Nov 2025).
Transfer bottlenecks: LLMs may encode relations but fail to apply them, requiring explicit patching or rerouting of hidden states to achieve transfer (Lee et al., 25 Nov 2025).
Prompt and architecture sensitivity: Subtle changes in prompt schema, permutation of pairs, or distractor placement can degrade performance by up to 50 percentage points in some LLMs, whereas humans are relatively insensitive to such manipulations (Musker et al., 2024, Inani et al., 15 Jul 2025).

Key deficit regimes include process analogies with entangled causal chains, narrative analogies requiring event reordering or goal alignment, and mathematical/sequential analogies that require compositional generalization.

5. Task Design and Benchmark Construction Principles

Robust analogical task design relies on the following empirical and theoretical principles:

Structural alignment: Select source–target pairs with deep governing principle or relation (e.g., force equilibrium in physics (Lin et al., 2016)).
Surface variation: Vary superficial features so that relational mapping is required for success (rope vs. friction in Newtonian problems; door vs. wheel in vision (Ichien et al., 2021)).
Explicit mapping scaffolds: Require stepwise, not merely list, mapping; model and grade mapping quality (Lin et al., 2016, Barakat et al., 22 May 2026).
Contrastive distractors: Include both “close” (within-domain) and “far” (cross-domain) analogies alongside matched distractors that foil superficial strategies (Sultan et al., 2024, Sourati et al., 2023).
Multi-level annotation: Provide mappings at the component, relation, and (for narratives and systems) system levels; leverage gold explanations for calibration (Yuan et al., 2023, Barakat et al., 22 May 2026).
Evaluation beyond aggregate accuracy: Item-level human–model alignment, ablation of reasoning steps, explanation-quality scoring, and ranking-by-utility all supply richer diagnostic insight (Inani et al., 15 Jul 2025, Barakat et al., 22 May 2026).

Notable scalable pipelines for analogy generation include ParallelPARC (LLM-in-the-loop analogical paragraph synthesis (Sultan et al., 2024)), modular four-stage generation/evaluation (Teaching Through Analogies (Barakat et al., 22 May 2026)), and open-/closed-domain analogy mining from scientific or narrative corpora (Sultan et al., 2022, Yuan et al., 2023).

6. Implications and Current Frontiers

Task analogies are a stringent testbed for human–machine parity in abstraction, transfer, and relational generalization. While current LLMs and vision-LLMs outperform random and match human ceiling on “near” and surface-aligned analogies, across domains they continue to lag in:

Cross-domain (far/system) analogical reasoning, especially in narrative and process settings (Sourati et al., 2023, Sultan et al., 2022)
Explicit mapping and explanation of analogical correspondences, not just answer selection (Barakat et al., 22 May 2026, Yuan et al., 2023)
Robustness to distractors, order permutations, and complex multi-component mappings (Musker et al., 2024, Bitton et al., 2022)
Extraction and articulation of implicit elements in metaphoric and literary analogies (Boisson et al., 2024)

A key emerging practice is modular benchmarking: leveraging structured and multi-level annotations (sub-concept, relation, system), explicit prompt-based scaffolding, and hybrid evaluation blending LLM-judges and human panels. Integration of chain-of-thought, explanation generation, and adversarial distractor construction are recommended for advancing both model capabilities and diagnostic sharpness.

For future research, systematic scaling of task analogies across scientific, visual, and narrative domains, together with robust cross-domain transfer evaluation and hierarchical analogical mapping, remain central challenges. Task analogies thus serve as both a practical diagnostic apparatus and a foundational problem for developing generalizable, structure-sensitive AI systems (Barakat et al., 22 May 2026, Yuan et al., 2023, Sultan et al., 2024, Sourati et al., 2023, Lee et al., 25 Nov 2025).