RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning (2505.13307v1)

Published 19 May 2025 in cs.CL, cs.AI, and cs.CV

Abstract: Chain-of-Thought (CoT) reasoning has proven effective in enhancing LLMs on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at https://github.com/LightChen233/reasoning-boundary.

Summary

The paper introduces RBF++, a framework that quantifies and optimizes chain-of-thought reasoning boundaries based on measurable difficulty levels.
It proposes a harmonic mean combination law and constant approximations to evaluate both measurable and unmeasurable capabilities across tasks.
Empirical results from 38 models and 13 tasks show that prompting strategies like MARP++ significantly enhance multimodal reasoning performance.

The paper "RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning" (2505.13307) introduces a framework called RBF++ to address key challenges in applying Chain-of-Thought (CoT) reasoning with LLMs in real-world scenarios. Specifically, it tackles the lack of quantitative metrics for evaluating and optimizing CoT capabilities and the difficulty in assessing capabilities that are hard to measure, such as multimodal perception or domain knowledge.

The core concept of RBF++ is the Reasoning Boundary (RB). This is defined as the maximum difficulty level a model can handle on a task while maintaining a target accuracy $K_1$ . Practically, the RB $\mathcal{B}_{Acc=K_1}(t|m)$ represents a performance limit for a model $m$ on a task $t$ as its difficulty $d$ increases. Task difficulty can be quantified by metrics like the number of reasoning steps or computational complexity.

For tasks requiring multiple capabilities, RBF++ proposes a Combination Law for RBs. This law suggests that the combined RB for multiple tasks ( $t_1, \dots, t_n$ ) within a model $m$ can be approximated using a harmonic mean of individual task RBs:

$\mathcal{B}(t_1, t_2, \dots, t_n) \approx \frac{1}{\sum^{n}_{i=1}\frac{1}{\mathcal{B}(t_i)}}$

where $\mathcal{B}(t_i)$ represents the normalized RB for task $t_i$ . This provides a practical method to estimate performance on complex tasks by understanding the model's limits on their constituent sub-tasks.

To address unmeasurable CoT capabilities, such as multimodal perception or broad domain knowledge, the framework introduces a Constant Assumption. If certain sub-boundaries are difficult to measure directly but are stable within a scenario, they can be replaced by scenario-specific constants in the combination law. For example, in a multimodal task, the unmeasurable vertical domain RB $\mathcal{B}(v)$ might be approximated by a constant $z_1$ derived from the model's performance on direct answer generation (without CoT):

$\mathcal{B}(p,o,v) = \frac{1}{\frac{1}{\mathcal{B}(p)}+\frac{1}{\mathcal{B}(o)}+{z_1}}$

where $\mathcal{B}(p)$ is planning RB and $\mathcal{B}(o)$ is operation RB.

The Reasoning Boundary Division Mechanism allows for finer-grained analysis. A unified RB can be systematically divided into sub-boundaries. For instance, the vertical domain RB $\mathcal{B}(v)$ can be split into domain knowledge RB $\mathcal{B}_k$ and multimodal perception RB $\mathcal{B}_{mm}$ :

$\mathcal{B}(v) = B(k,mm) = \frac{1}{\frac{1}{\mathcal{B}_{k} + \frac{1}{\mathcal{B}_{mm}}}$

If $\mathcal{B}_{mm}$ is treated as a constant $z''$ , the combination law extends:

$\mathcal{B}(o,p,v) = \frac{1}{\frac{1}{\mathcal{B}(p)}+\frac{1}{\mathcal{B}_o}+\frac{1}{\mathcal{B}(k)} + z'}$

where $z' = 1/z''$ . This allows for targeted optimization of specific, even unmeasurable, capabilities.

RBF++ categorizes RBs into three types based on accuracy:

Completely Feasible Reasoning Boundary (CFRB): Accuracy $\ge 90\%$ . Tasks within this boundary are reliably mastered by the model.
Partially Feasible Reasoning Boundary (PFRB): Accuracy $10\% < Acc < 90\%$ . Tasks require additional effort like repeated reasoning or external information.
Completely Infeasible Reasoning Boundary (CIRB): Accuracy $\le 10\%$ . Tasks are generally unsolvable by the model.

Empirical validation across 38 models and 13 tasks supports these concepts. Experiments show the existence of distinct RBs across tasks like arithmetic, natural language planning, and code planning. The combination law is verified in complex arithmetic, mathematical reasoning, and multi-hop QA. The nature analysis reveals that tasks within CFRB are often solvable zero-shot, PFRB tasks benefit significantly from consensus-building methods like Self-Consistency, while CIRB tasks remain difficult even with such methods. The paper also suggests LLMs have some self-awareness of their RBs, generating synthetic data primarily within their CFRB.

The framework provides insights for optimizing CoT strategies. It explains that methods like Tool Usage (TU) and Program-of-Thought (PoT) improve performance in textual mathematical tasks by effectively enhancing the calculation RB ( $\mathcal{B}(c)$ ). TU makes $\mathcal{B}(c)$ approach infinity, while PoT improves planning $\mathcal{B}(p)$ through structured code.

However, traditional methods like Complex CoT (CCoT) and Least-to-Most (LtM) face challenges, especially in multimodal scenarios. CCoT breaks down steps to ease calculation RB but can increase planning pressure. LtM breaks problems into sub-questions to ease local planning but can drastically increase global planning pressure, leading to failure when the model's global planning RB is exceeded.

To address these limitations, the paper introduces Minimum Acceptable Reasoning Paths (MARP) for textual tasks and MARP++ for multimodal tasks. These prompting strategies aim to optimize reasoning by framing problems within acceptable RBs and managing computational and planning loads.

MARP aims to reduce computational and planning burdens by:

Limiting single-step operations (e.g., $\le 5$ basic operations, multiplication $< 1.5e5$ ).
Maximizing computation per step within limits to reduce the number of global planning steps.

A potential prompt snippet for MARP might look like this:

1
2

You need to perform multi-step reasoning, with each step carrying out as many basic operations as possible.
Remember, you can only complete tasks that contain up to 5 basic operations per step, and multiplication operations must be less than 1.5e5. The upper limit of the multiplication operations decreases as the number of operations per step increases.

MARP++ extends this to multimodal scenarios, incorporating considerations for unmeasurable capabilities like visual perception and domain knowledge by adding explicit constraints in the prompt:

You are required to perform multi-step reasoning, ensuring that each step operates within clearly defined boundaries:
*   Global Planning Boundary: Focus on the overall strategy and high-level goal. You should break down the task into manageable steps (less than 15 steps) within your capabilities but always consider the broader objective to ensure coherence in the approach.
*   Local Step Operation Boundary: In each step, perform as many basic operations as possible, but each step must adhere to a limit of 5 basic operations. Avoid exceeding this boundary to maintain clarity and precision at each stage.
*   Multimodal Perception Boundary: When reasoning, incorporate all available information (text, images, etc. if available) without overstepping the boundaries of what can be processed in one step. Make sure to integrate the relevant modalities effectively within the defined operation limits. If perception is very difficult, please divide it into multiple steps for multimodal perception.
*   Multimodal Perception Boundary: Domain-Knowledge Boundary: Utilize your domain knowledge effectively but ensure that each step remains grounded within your expertise. Do not go beyond what is strictly necessary for the current step.

The experiments show MARP++ achieves significant accuracy improvements in multimodal reasoning across various domains compared to other CoT strategies.

The paper also explores the scaling of RBs across different LLMs, including advanced reasoning models like DeepSeek-R1. It finds a strong positive correlation between RB values and performance on benchmarks. It also notes that while advanced models show improvements in CIRB (solving previously infeasible tasks), the CFRB (tasks the model reliably masters) improvements are often less pronounced, suggesting this is a key area for future development in LLMs.

In summary, RBF++ provides a theoretical and empirical framework for understanding and optimizing CoT reasoning boundaries in LLMs. By quantifying RBs, including those related to unmeasurable capabilities through approximations and division, it offers actionable guidance for developing more effective prompting strategies like MARP++, improving LLM performance on complex, real-world tasks across modalities.