Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint (2505.23759v1)

Published 29 May 2025 in cs.CL, cs.AI, cs.CV, and cs.LG

Abstract: Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-LLMs (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

This paper investigates the capabilities of current Vision-LLMs (VLMs) in solving rebus puzzles, a task requiring multi-modal abstraction, symbolic reasoning, and lateral thinking. Unlike standard VLM benchmarks focusing on image captioning or visual question answering, rebus puzzles demand a deeper level of cognitive processing to interpret visual and linguistic cues, spatial arrangements, and symbolic substitutions to decode phrases or idioms.

To probe VLM performance on this task, the authors created a hand-generated and annotated dataset of 432 diverse English-language rebus puzzles. Each puzzle is labeled with the correct solution and categorized by the specific cognitive skills needed to solve it. The 11 identified skills include: Absence or Negation (AN), Text Orientation (TO), Quantitative or Mathematical Reasoning (QMR), Visual Metaphors and Cultural References (VMCR), Symbolic Substitution (SS), Font Style/Size (FS), Letter and Word Manipulation (LWM), Phonetics and Wordplay (PW), Spatial and Positional Reasoning (SPR), Image Recognition (IR), and Text Recognition (TR).

The paper evaluated a range of contemporary VLMs, including both closed-source models like GPT-4o (2410.21276), o3 [openai2025o3], o4-mini [openai2025o3], Gemini 2.5-pro [google2025gemini25], Gemini 2.5-flash [google2025gemini25], Claude 3.7-sonnet [TheC3], Claude 3.5-haiku [TheC3], and open-source models like Qwen2.5-VL-7B [bai2025qwen2] and Phi-4 [abdin2024phi]. Performance was measured using two metrics: "Naive Matching" (exact string match) and "LLM-Judged" evaluation (using GPT-4o to semantically compare the predicted and ground truth answers).

Key findings from the evaluation include:

  • Overall Performance Gap: While some models, particularly closed-source "reasoning" models (o3, o4-mini, gemini-2.5-pro), show moderate performance, they significantly lag behind human expert solvers. Open-source and non-reasoning models perform much worse.
  • Skill-Specific Weaknesses: VLMs exhibit surprising competence in tasks involving direct symbolic manipulation (SS, SPR) and quantitative reasoning (QMR), likely due to training data focus. However, they struggle significantly with abstract reasoning, lateral thinking, and understanding nuanced cues, particularly Absence or Negation (AN) and Visual Metaphors/Cultural References (VMCR). Performance on Phonetics and Wordplay (PW) also lags behind symbolic substitution (SS), suggesting difficulty with phonetic transformations compared to direct symbol-to-meaning mappings. The disparity between Spatial Reasoning (SPR) and Letter and Word Manipulation (LWM) suggests models can understand spatial layout but fail to apply complex linguistic logic to the elements within that layout.
  • Impact of Prompting and Context:
    • In-Context Learning (ICL): Providing a single example did not substantially improve performance for stronger models, suggesting their limitations are not primarily due to misunderstanding the task format but rather fundamental reasoning capacity. Smaller models saw minor gains.
    • Skill-Guided Prompting: Explicitly listing the required cognitive skills in the prompt led to only minor improvements. This suggests an "awareness vs. execution gap," where models might know what kind of reasoning is needed but lack the ability to perform it effectively.
    • Iterative Refinement: Allowing models multiple attempts with feedback resulted in nominal performance gains, indicating some capacity for self-correction but also a clear ceiling on improvement within a few tries.
  • Importance of Visual Input: Evaluating models solely on detailed captions of the images resulted in a significant drop in performance, especially for reasoning models. This suggests that direct visual access and the ability to iteratively examine visual content during the reasoning process are crucial for these models, and relying on a single-pass caption is insufficient.
  • Underlying Vision Capabilities: An analysis of image retrieval performance using contrastive models (like CLIP (1004.00062) variants, SigLIP (Tschannen et al., 20 Feb 2025 ), TULIP (Tang et al., 19 Mar 2025 )) showed that while architectural design and scale impact performance, even strong retrieval scores do not guarantee high rebus-solving ability, highlighting the distinct challenge of multi-modal reasoning beyond just feature extraction.

The paper concludes that current VLMs lack the necessary lateral thinking and nuanced multi-modal abstraction skills for complex visual puzzles like rebuses. Future work needs to focus on improving abstract reasoning, handling negation and visual metaphors, enhancing iterative refinement capabilities, addressing the awareness-execution gap, and better integrating iterative visual processing into the reasoning loop. The findings point towards fundamental limitations in VLM architecture and training data regarding higher-order cognitive functions required for human-like puzzle-solving.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. HeeKyung Lee (3 papers)
  2. Jiaxin Ge (14 papers)
  3. Tsung-Han Wu (29 papers)
  4. Minwoo Kang (11 papers)
  5. Trevor Darrell (324 papers)
  6. David M. Chan (30 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com