RE-IMAGINE: Symbolic Benchmark Synthesis for Reasoning Evaluation (2506.15455v1)

Published 18 Jun 2025 in cs.CL and cs.AI

Abstract: Recent LLMs have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true reasoning or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE, a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

Summary

The paper presents a hierarchical framework that evaluates LLM reasoning through three levels: observe, mutate, and imagine.
It employs automatic symbolic transformations of existing benchmarks to generate novel problem variations and test generalization.
Results show significant performance drops at higher levels, highlighting challenges in counterfactual reasoning and the need for better evaluation metrics.

Overview of the Paper: Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation

The paper "Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation" addresses the crucial question of whether the high accuracy of LLMs on existing reasoning benchmarks genuinely reflects reasoning abilities or is a consequence of statistical memorization of training data. It introduces "Re-Imagine," a comprehensive framework that evaluates LLMs' reasoning capabilities through a hierarchical classification inspired by the ladder of causation, and provides a scalable method to generate varied problem sets to rigorously test different levels of reasoning.

The authors propose a three-tiered hierarchy to assess reasoning abilities in LLMs:

Observe: At this level, models are evaluated on direct reasoning problems from existing benchmarks, capturing the accuracy on problems they might have encountered during training or similar variations.
Mutate: This level introduces mutated problems with altered components, such as modified values, added irrelevant information, or renamed variables. The objective is to determine if models can generalize beyond memorized data while maintaining logical consistency.
Imagine: Here, the framework examines the ability of LLMs to incorporate new logical elements into existing problems, challenging them with counterfactuals or additional hypothetical conditions. This level is the most complex, probing models' abilities to adjust reasoning based on new or contradictory information.

Methodology

The paper implements a robust, scalable pipeline that performs automatic symbolic transformations of problems from existing benchmarks—such as GSM8K for math, CLadder for causality, and others—into novel variations. Each problem is first translated into a symbolic form (usually a code representation) and then subjected to different transformations. These transformations alter the symbolic representation to produce new problems, ranging from simple value changes to the introduction of counterfactual statements and additional dependencies. The altered problems are then translated back into natural language for LLM evaluation.

Experimental Findings

The authors conduct comprehensive evaluations using multiple LLM families, including GPT, Llama, and Phi models. Key findings outline:

A significant decrease in model performance as they ascend the reasoning hierarchy. The drop from Level-1 (observe) to Level-2 (mutate) suggests reliance on memorized data, while Level-3 (imagine) reveals substantial limitations in handling counterfactual reasoning and new logical constructs.
For example, models performed notably poorer on GSM8K variations with Level-3 mutations, demonstrating difficulties in integrating new logic into pre-existing reasoning structures. Similar performance drops were observed in other benchmarks like CLadder and CRUXEval.

Implications and Future Directions

The insights gathered emphasize the current boundaries of LLM reasoning capabilities and the necessity for enhanced evaluation metrics that distinguish genuine reasoning from mere memorization. The proposed methodology provides a scalable approach to testing, crucial for future LLM evaluation frameworks. Looking ahead, improving LLMs' handling of complicated reasoning tasks may involve integrating stronger symbolic reasoning skills and developing systems that better understand and adapt to new contexts.

Overall, the "Re-Imagine" framework sets a rigorous standard for LLM evaluation, potentially guiding future developments in AI systems aiming for genuine cognitive-like reasoning. As the field advances, this framework can significantly contribute to creating more robust and reliable AI systems across diverse application domains.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1936085320988344827