VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Published 29 Mar 2025 in cs.CV | (2503.23064v2)

Abstract: Large Vision-LLMs (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving. Project page: https://yufan-ren.com/subpage/VGRP-Bench/.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents VGRP-Bench as a novel benchmark evaluating LVLMs on puzzles like Sudoku, Futoshiki, and Thermometers.
It systematically measures LVLMs’ visual perception, rule adherence, and logical reasoning under varying puzzle complexities.
Post-training methods such as S-SFT and R-SFT improve puzzle-solving but face challenges in generalization and overfitting.

VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-LLMs

Introduction

The paper introduces VGRP-Bench, a benchmark designed to evaluate Large Vision-LLMs (LVLMs) on visual grid reasoning puzzles. LVLMs often struggle to solve such puzzles, which test their perception, rule comprehension, and logical reasoning skills. VGRP-Bench includes 20 diverse puzzles and aims to systematically assess these capabilities through a customizable framework that spans various difficulty levels.

Benchmark Overview

VGRP-Bench is structured around grid-based visual reasoning and a taxonomy of puzzle rules and attributes (Figure 1). The benchmark comprises tasks like Sudoku, Futoshiki, and Thermometers, requiring logical deduction and constraint satisfaction for successful completion. The benchmark is designed to evaluate state-of-the-art LVLMs, including perception accuracy, rule adherence, and overall puzzle-solving capabilities. It separates perception and reasoning by providing text versions of puzzles for a comprehensive evaluation.

Figure 2: Puzzle-solving rate of state-of-the-art chat LVLMs on easy-level puzzles associated with each rule.

Experiments and Key Findings

The paper reports extensive experiments where LVLMs, including closed-source models like GPT-4o and open-source models like Llama 3.2, fail to solve easy level puzzles consistently. The benchmark highlights significant limitations in LVLMs' puzzle-solving capabilities, indicating challenges in tasks ranging from number localization in Sudoku grids to maintaining a coherent reasoning process.

Figure 3: Off-the-Shelf LVLMs on Level- with CoT.

The research identifies critical factors influencing performance, such as difficulty level, grid size, number of clues, and complexity of rules affecting LVLMs' reasoning capabilities. Moreover, the distinction between visual perception and text input versions reveals vision-related challenges in LVLMs.

Figure 4: Results with Different Number of Clues on Level-.

Post-Training and Generalization

Two post-training techniques are explored: Solution Supervised Fine-Tuning (S-SFT) and Reasoning Supervised Fine-Tuning (R-SFT). Both methods significantly improve puzzle-solving at the trained difficulty levels but show limited generalization to unseen puzzles. The paper suggests potential overfitting risks associated with these techniques.

Figure 5: Comparing S-SFT and R-SFT on Level-.

Limitations and Future Work

The computational cost of post-training on large models limits current experiments to smaller models. Future work may explore inference-time strategies like Monte Carlo Tree Search to enhance puzzle-solving or integrate Reinforcement Learning with outcome-based reward models.

Conclusion

VGRP-Bench provides a robust evaluation framework for LVLMs on visual reasoning puzzles, highlighting fundamental challenges and performance limitations. The benchmark brings insights into LVLMs' real-world problem-solving capabilities, paving the way for future advancements in visual reasoning tasks. The systematic evaluation of LVLMs through VGRP-Bench is intended to inspire further research and development in addressing complex puzzles with AI models.

Markdown Report Issue