From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? (2506.08295v1)

Published 9 Jun 2025 in cs.LG, cs.CL, and cs.AI

Abstract: While existing benchmarks probe the reasoning abilities of LLMs across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.

PDF Abstract

Evaluating Active Reasoning Abilities of LLMs through AR-Bench

The academic paper titled "From Passive to Active Reasoning: Can LLMs Ask the Right Questions under Incomplete Information?" proposes a novel benchmark called AR-Bench, specifically designed to evaluate the active reasoning capabilities of LLMs. Active reasoning tasks diverge substantially from traditional passive reasoning paradigms, as they necessitate that the model not only process existing information but also interact dynamically with external systems to gather missing evidence. This paper addresses the oversights in existing benchmarks and provides critical insights into the shortcomings and future directions for enhancing the active reasoning capabilities of LLMs.

Objective and Scope

The primary objective of AR-Bench is to rigorously examine the active reasoning abilities of LLMs. The benchmark introduces three distinct tasks:

Detective Cases (DC): Simulates interrogation scenarios where models must deduce the murderer by interacting with suspects.
Situation Puzzles (SP): Based on lateral thinking puzzles where models must uncover the truth through yes-or-no questioning.
Guessing Numbers (GN): Requires models to deduce a four-digit number through feedback on digit correctness and positioning.

These tasks emphasize the need for LLMs to formulate questions that will elicit critical information, thereby examining their proficiency in reasoning with incomplete data.

Methodological Approach

AR-Bench utilizes a multi-round interaction process between the model and non-player characters (NPCs) or a judge model. This setup aims to simulate real-world problem-solving scenarios where information is not readily available, and models must proactively seek answers. The evaluation includes:

Outcome Metrics: Accuracy in identifying the true murderer in DC, character-level F1 scores in SP, and exact match rates in GN.
Process Metrics: Assess the quality and strategic effectiveness of questions asked by the model during the interaction rounds.

Experimental Results

The paper undertakes a comprehensive assessment using various state-of-the-art LLMs and employs multiple reasoning methods, including zero-shot, few-shot, tree-of-thought, and supervised fine-tuning. Key observations include:

Significant Performance Gap: LLMs exhibit substantial difficulty with active reasoning tasks, as demonstrated by low exact match rates, especially in GN.
Limited Impact of Model Scaling: Increasing the size of models yielded improvements, yet still fell short of effectively addressing active reasoning challenges.
Early Gains and Late Plateaus: Initial rounds of interaction show notable progress, but this diminishes in later rounds, suggesting difficulties in maintaining context coherence and generating high-quality questions consistently.

Implications and Future Research Directions

The findings of AR-Bench underline the critical need for developing new methodologies tailored to active reasoning. Possible directions include:

Data-Centric Approaches: Creating high-quality datasets focusing on interactive and dynamic question-asking strategies.
Algorithmic Enhancements: Exploring reinforcement learning techniques that integrate questioning strategies with outcome-based rewards, promoting exploration and iterative learning.
Model Evaluation using LLM Judges: Suggesting the use of reliable LLMs as judges for automated evaluation, considering scalability and cost-effectiveness in human judgment tasks.

The paper’s contributions extend beyond merely evaluating active reasoning capabilities and invite researchers to push the boundaries of LLMs by addressing the dynamic nature of real-world applications. By doing so, the paper aims to bridge the gap between current passive reasoning proficiency and the demanding requirements of scenarios governed by incomplete information.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zhanke Zhou (21 papers)
Xiao Feng (61 papers)
Zhaocheng Zhu (22 papers)
Jiangchao Yao (74 papers)
Sanmi Koyejo (110 papers)
Bo Han (282 papers)

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? (2506.08295v1)

Evaluating Active Reasoning Abilities of LLMs through AR-Bench

Objective and Scope

Methodological Approach

Experimental Results

Implications and Future Research Directions

Related Papers

GitHub

YouTube