Visual Metonymy in Computational Imaging
- Visual metonymy is a phenomenon where images indirectly evoke abstract concepts through associated visual cues, leveraging core semiotic principles.
- The computational pipeline involves generating representamen cues, chain-of-thought reasoning, and image synthesis to avoid literal depictions while evoking target ideas.
- The ViMET dataset demonstrates this process at scale, revealing a sizable gap between human inference and vision-language models in interpreting metonymic imagery.
Visual metonymy is the phenomenon whereby images communicate indirectly by evoking a target concept through associated visual cues, rather than explicit depiction. Drawing on semiotic theory, especially Charles S. Peirce’s triadic model, visual metonymy involves a representamen (a set of concrete signs or “vehicles”) that stands for an object (the target concept) and prompts the viewer—via their interpretant function—to infer the underlying concept based on visual associations. Recent computational advances allow for the algorithmic generation and evaluation of visual metonymy, revealing a notable performance gap between human and machine reasoning in this domain (Ghosh et al., 25 Jan 2026).
1. Formalization of Visual Metonymy
Grounded in Peirce’s semiotic triad, visual metonymy is characterized as a process involving three roles:
- O (object): The abstract target concept intended for evocation (e.g., “Artist”).
- R (representamen): A set of concrete image cues or “vehicles” (e.g., canvas, paint brush).
- I (interpretant): The mental concept that a viewer infers from R.
The inference sequence is formalized as . More formally, for a concept , a metonymic image is produced by:
- Selecting a set of representamen cues via , where is an association function.
- Composing these cues into a visual scene.
- Ensuring that the viewer-side mapping yields and .
In practice, images contain only visual cues and never literal representations of , specified in -notation as (Ghosh et al., 25 Jan 2026).
2. Computational Pipeline for Image Generation
A three-stage computational pipeline operationalizes visual metonymy:
- Representamen Generation : An LLM (Llama 3.1–70B-Instruct) is few-shot prompted with object–representamen examples (temperature=0.9, top-p=0.9) to produce a set of concrete, associated visual cues .
- Chain-of-Thought Visual Description : Given and , another LLM prompt employs chain-of-thought reasoning to generate a naturalistic or stylistic scene description . Output images must avoid any direct mention of ; if appears, resampling occurs.
- Image Synthesis : A text-to-image generator (Stable Diffusion 3.5-Large, 35 inference steps, guidance=7.5) renders the scene into a visual instance .
The end-to-end function is:
Concepts are filtered by a concreteness score (Brysbaert et al., 2014) and restricted to WordNet supersense classes with at least 60% metonymic judgments in pilot sampling (Ghosh et al., 25 Jan 2026).
3. The ViMET Dataset: Construction and Annotation
ViMET, the first visual metonymy benchmark, operationalizes this pipeline at scale. From 2,077 candidate concepts, two images per concept were generated and filtered to retain 1,000 high-metonymy concepts in both naturalistic (realism) and stylistic (abstract art) visual styles, for a total of 2,000 images.
Annotation Process:
- Metonymy Judgment: Three annotators independently assessed whether each image evokes the concept without depicting it literally. Raw agreement was , Cohen’s . The pipeline achieved a metonymy rate of 84.3%, significantly outperforming naïve prompts (41.2%).
- Multiple-Choice Construction: Each image formed the basis for a 4-way multiple-choice question (MCQ) comprising the gold and three distractors, selected by visual similarity (CLIP embeddings) and semantic relatedness (ConceptNet). Synonyms were removed, and distractors were constrained by BERT-cosine similarity and ConceptNet distance (Ghosh et al., 25 Jan 2026).
4. Model Evaluation and Performance Gaps
Evaluation is based on an MCQ task: given image and candidate concepts , select the concept evoked. Accuracy is measured over 2,000 items.
Vision-LLMs (VLMs) Evaluated:
- Llama 3.2 11B, Llama 3.2 90B, Llama 4 Scout
- InternVL3 8B/78B
- Qwen 2.5 7B/72B
- Gemini 2.5 Flash/Pro
Results:
| System | Overall Accuracy (%) | Naturalistic (%) | Stylistic (%) |
|---|---|---|---|
| Human baseline | 86.9 | 85.6 | 88.1 |
| InternVL3 78B | 65.9 | 66.4 | 66.4 |
Association-type breakdown (250-sample analysis):
| Association Type | VLMs (%) | Humans (%) |
|---|---|---|
| Cultural | 66.6 | 88.3 |
| Contextual | 54.5 | 75.2 |
| Symbolic | 76.3 | 92.1 |
A consistent ∼21.0% gap was observed between top-performing models and human raters. VLMs exhibit comparable accuracy across realism and abstract styles, but consistently fail to match human associative inference, particularly with contextual cues (Ghosh et al., 25 Jan 2026).
5. Error Analysis and Cognitive Challenges
Image Generation Failures:
- LLMs sometimes leak the literal concept or resort to overly generic cues, producing non-metonymic, literal images.
- Concepts lacking strong or culturally salient associations yield incoherent or ambiguous visual cues, reducing metonymy accuracy.
ViMET MCQ Error Types for VLMs:
- Contextual distractors often create ambiguity (e.g., “Academic” versus “Education”).
- VLMs typically rely on surface-level recognition; they struggle to integrate multiple cues to infer one abstract, unifying concept.
The central challenge lies in within-domain associative inference: linking multiple, distributed, often subtle cues in a scene to a singular, abstract concept—a task where object detection and literal captioning approaches are insufficient (Ghosh et al., 25 Jan 2026).
6. Research Directions
Potential avenues for improving computational visual metonymy include:
- Step-wise Evaluation: Human or automatic scoring of representamen generation and scene description phases can quantify bottlenecks and optimize each stage.
- Knowledge-Anchored Representamens: Expanding association sets using larger knowledge graphs and multimodal embeddings can address low-resource or culturally nuanced concepts.
- End-to-End Fine-Tuning: Joint optimization of LLM and diffusion components using metonymy-oriented reward signals (e.g., human feedback, CLIP-based scores) may reduce leakage of explicit content.
- Beyond Static Imagery: Introducing video or sequential imagery could leverage additional temporal context for richer metonymic inference.
- Cultural and Ethical Calibration: Developing bias detection and mitigation protocols is essential to prevent stereotyped or offensive metonymic renderings (Ghosh et al., 25 Jan 2026).
These lines of investigation seek to bridge the cognitive reasoning gap observed in current multimodal systems, and to extend visual metonymy to more diverse, context-sensitive, and culturally aware applications.