Visual Analogy Riddle: Reasoning & Security
- Visual analogy riddles are structured visual puzzles that require solvers to infer relational mappings between images through invariant transformations.
- They serve as key benchmarks for evaluating compositional reasoning in vision-language models using both closed-form choice and open-form synthesis protocols.
- Their application extends to security exploits such as object replacement attacks that leverage latent semantic associations to bypass model safety checks.
A visual analogy riddle is a structured visual puzzle that tests the ability to recognize analogical relationships between objects or concepts through images. Such riddles exploit invariances and transformations in visual structure, requiring a solver—human or model—to infer a mapping or equivalence based on patterns of similarity and difference. While popular in cognitive psychology and standardized testing, visual analogy riddles have become central testbeds for evaluating generalization, compositional reasoning, and robustness in contemporary computer vision, multimodal learning, and vision-LLMs (VLMs).
1. Formal Structure of Visual Analogy Riddles
A visual analogy riddle typically presents a set of images or entities in a structured pattern, often in the canonical “A:B::C:?” format. Here, images A and B are related by a specific transformation or abstract relationship, and C is paired with a missing counterpart “?”, which the solver must infer to complete the analogy. Formally, this can be modeled as a mapping function such that , where adheres to the same relation linking and as must to .
Key properties include:
- The requirement for relational, not absolute, similarity—solvers must identify how is to as is to 0.
- The prevalence of abstract rule transformations (e.g., rotation, color change, compositional operations).
- The separation between surface similarities and structural analogies.
2. Core Methodologies: Datasets and Evaluation Protocols
Visual analogy riddles have informed both data design and model evaluation. Early instances appear in the Bongard Problems and Raven’s Progressive Matrices, with the latter serving as the blueprint for machine learning benchmarks such as RAVEN and CLEVR.
Evaluation protocols distinguish between:
- Closed-form choice: Given several candidates for “?”, select the one with the correct analogical relation.
- Open-form synthesis: Generate or retrieve a visual answer filling the analogical relationship.
Recent VLMs are evaluated on their ability not only to identify the correct candidate but also to express the inferential process and generalize to novel relations unseen during training. A fundamental finding is that while VLMs exhibit strong pattern-matching abilities, they often struggle with abstract relational mapping, especially when distractors are adversarially constructed (Azulay et al., 1 May 2026).
3. Visual Analogy and Object Replacement
The logic of visual analogy riddles has been leveraged to analyze and conduct object replacement attacks on VLMs. In the context of multimodal jailbreak research, such as the Visual Object Replacement (VOR) attack (Azulay et al., 1 May 2026), the riddle format is intentionally repurposed to circumvent model safety constraints. Here, benign surrogates are visually substituted for prohibited objects, and VLMs are prompted to draw the hidden analogy, reconstructing the latent prohibited referent and acting on it. The analogy exploit leverages the model’s ability to integrate visual context and latent associations, passing compliance checks designed for explicit content.
In this framework, the riddle operates as follows: edited images with benign objects (e.g., bananas) paired with “neutralized” prompts using placeholders (e.g., X₁), with explicit instructions to decode the assignment X₁→object-implied-by-image, expecting the model to solve the visual analogy.
4. Mechanistic Modeling and Interpretability
The processing of visual analogy riddles by deep neural models involves interactions between early-layer appearance-driven activations and mid-to-late-layer semantic associations. Empirical interpretability analyses (Azulay et al., 1 May 2026) on large VLMs (e.g., Qwen3-VL-32B) reveal:
- Early transformer layers are dominated by the literal visual appearance of the benign substitute.
- As processing advances, contextual, analogical associations reawaken the latent semantics (e.g., “banana” → “bomb”) based on scene and prompt structure.
- Refusal-detection mechanisms, guided by appearance-based directions in representation space, can be bypassed; semantic compliance can emerge purely via relational analysis and analogical inference.
This mechanistic dissociation directly explains why visual analogy riddles, when used adversarially, can bypass robust safety tuning in vision-language systems.
5. Applications and Security Implications
Beyond cognitive assessment and robustness benchmarks, visual analogy riddles underpin a class of VLM security exploits. The VOR attack, for example, achieves attack success rates (ASR) that are competitive with or exceed textual analogues across multiple frontier models (Azulay et al., 1 May 2026). The table below summarizes empirical findings for textual versus visual analogy-based attack routes:
| Model | Textual Replacement ASR (%) | VOR Analogy Attack ASR (%) |
|---|---|---|
| Gemini-3.1-Pro | 19.0 | 45.6 |
| Qwen3-VL-32B | 39.0 | 41.1 |
| Claude-Haiku-4.5 | 8.1 | 4.1 |
In five out of six models tested, visual analogy-based attacks matched or outperformed textual baselines. This demonstrates that analogy riddles are not only theoretically elegant but operationally potent in modern multimodal AI security contexts.
6. Mitigation and Future Directions
Defending against analogy-driven visual attacks requires new evaluation and mitigation strategies. Current defenses relying on input appearance or keyword filtering are insufficient, as the analogy mechanism is fundamentally relational and context-driven.
Effective mitigation combines:
- Input-side novelty detection: recognizing analogical transformations and surrogate mappings.
- Output-side, modality-agnostic classifiers: classifying compliance with harmful intent regardless of content modality (Azulay et al., 1 May 2026).
A plausible implication is that robust VLM safety and general-purpose multimodal reasoning will increasingly require models to explicitly represent, reason about, and, potentially, refuse to execute inferred analogies that violate policy, even when the analogy is latent and visually obfuscated.
7. Related Research and Connections
Visual analogy riddles intersect with research in few-shot compositional generalization (e.g., SwapAnything (Gu et al., 2024)), security and attack analysis (Azulay et al., 1 May 2026), compositional video editing (Saini et al., 2024), and cognitive benchmarks for abstract relational reasoning. Advances in VLM alignment, object replacement, and masked reasoning are likely to further clarify the role of analogy in both robust reasoning and adversarial manipulation.
Key papers:
- “Jailbreaking Vision-LLMs Through the Visual Modality” (Azulay et al., 1 May 2026)
- “SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing” (Gu et al., 2024)
- “InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models” (Saini et al., 2024)