PFRB in LLMs: Measuring Reasoning Boundaries

Updated 7 April 2026

PFRB is a formal concept that delineates an LLM's transition zone between complete feasibility (CFRB) and infeasibility (CIRB) in chain-of-thought reasoning.
The framework employs harmonic-mean laws to combine measurable subtasks and uses constants for unmeasurable components, enabling precise estimation of reasoning limits.
Empirical findings across benchmarks validate PFRB’s utility for targeted model optimization, enhancing understanding of LLM performance under varying task complexities.

A partially feasible reasoning boundary (PFRB) is a technical concept in the quantitative analysis of LLM reasoning capabilities, formalized within the Reasoning Boundary Framework++ (RBF++). PFRB specifies the band of task difficulty within which an LLM’s accuracy transitions between complete feasibility and infeasibility for chain-of-thought (CoT) reasoning. In RBF++, this notion provides a rigorous and actionable partitioning of models’ reasoning limits across both measurable and unmeasurable cognitive dimensions, supporting targeted optimization and theory-grounded benchmarking (Chen et al., 19 May 2025).

1. Mathematical Formalization of Reasoning Boundaries

Let $M$ be a fixed LLM, $T$ a reasoning task, and $d \in \mathbb{D}$ a scalar quantifier of task difficulty (e.g., number of arithmetic steps, plan depth, multi-hop count). For accuracy threshold $K_1 \in [0,1]$ , the reasoning boundary is

$\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$

Here, $\text{Acc}(T \mid d, M)$ is model accuracy for task $T$ at difficulty $d$ . Typically, three regions are delineated:

CFRB ( $\mathcal{B}_{\text{Acc}\ge0.90}$ ): completely feasible region (accuracy at least 90%).
CIRB ( $\mathcal{B}_{\text{Acc}\le0.10}$ ): completely infeasible region (accuracy at most 10%).
PFRB: the intermediate band $T$ 0, i.e., $T$ 1.

This partition allows precise localization of a model's chain-of-thought capability threshold as a function of task complexity.

2. Combination Law for Measurable Subtasks

Complex reasoning tasks are typically decomposed into subtasks $T$ 2, each exhibiting discrete reasoning boundaries. RBF++ demonstrates that, under mild independence and smoothness assumptions, the combined RB is governed by a harmonic-mean law. In the normalized case,

$T$ 3

More generally, allowing per-task scale $T$ 4 and offset $T$ 5,

$T$ 6

Empirically, this law accurately predicts RBs in GSM8K multi-step mathematics (90% and 10% contours), HotpotQA multi-hop QA (global planning and entity knowledge RBs), and other tasks (Chen et al., 19 May 2025). The combination law enables quantitative dissection of complex multi-component reasoning and provides actionable compositional guidance.

3. Handling Unmeasurable Reasoning Boundaries: Constant Assumption and Division

In many real-world or multimodal tasks, some sub-boundaries—such as domain knowledge breadth or perception ability—are not experimentally variable. RBF++ replaces each such unmeasurable sub-RB with a scenario-specific constant $T$ 7: $T$ 8 $T$ 9 is computed by evaluating non-CoT direct accuracy for the corresponding sub-domain and solving for the effective RB denominator, enabling continuity of the combination-law machinery when unmeasurable factors are present.

Where such an unmeasurable RB $d \in \mathbb{D}$ 0 (e.g., vertical-domain reasoning) is still too coarse, RBF++ proposes a division mechanism: $d \in \mathbb{D}$ 1 for instance, decomposing $d \in \mathbb{D}$ 2 into domain knowledge ( $d \in \mathbb{D}$ 3) and multimodal perception ( $d \in \mathbb{D}$ 4): $d \in \mathbb{D}$ 5 with further constants used to fix perception complexity when invariant.

4. Empirical Findings: PFRB Bandwidth and Model Behavior

Extensive experiments validate the PFRB formulation, using 38 models (27 text LLMs and 5 multimodal LLMs) across 13 benchmarks. Quantitative highlights include:

For multiplication, $d \in \mathbb{D}$ 6, $d \in \mathbb{D}$ 7.
Step-planning RB: $d \in \mathbb{D}$ 8 steps, $d \in \mathbb{D}$ 9 steps.
BigGSM (GPT-3.5-Turbo): CoT $K_1 \in [0,1]$ 0, Tool Usage (TU) $K_1 \in [0,1]$ 1, Program-of-Thought (PoT) $K_1 \in [0,1]$ 2.
In PFRB, self-consistency voting boosts accuracy from $K_1 \in [0,1]$ 3 to $K_1 \in [0,1]$ 4; in CFRB, zero-shot CoT rationales increase correctness $K_1 \in [0,1]$ 5 over PFRB/CIRB; in CIRB, ensemble techniques yield no tangible gain (always $K_1 \in [0,1]$ 6) (Chen et al., 19 May 2025).
Synthetic-CoT prompts localize $K_1 \in [0,1]$ 7 of samples into CFRB, demonstrating models' self-awareness of their RB.
In multimodal contexts (M3CoT), direct-prompt measurable $K_1 \in [0,1]$ 8 and the constant-augmented combination law locate distinct 90%/10% RBs, with similar three-zone structure.
Open-source models often have $K_1 \in [0,1]$ 9 in CFRB, indicating significant headroom.

5. Strategies for Optimizing the Partially Feasible Region

PFRB can be deliberately manipulated by targeting its constituent sub-boundaries:

Measurable boundaries: Tool Usage ( $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 0), PoT (raises $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 1), MARP (caps per-step operations).
Domain-knowledge RB ( $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 2): Context injection, retrieval, expert-curated exemplars.
Perceptual RB ( $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 3): Attention-focused prompting, object cropping, perceptual tool integration.
Optimization in practice: MARP++ (explicit multimodal/perception/knowledge constraints) raises accuracy to $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 4, outperforming both standard MARP ( $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 5) and baseline CoT (Chen et al., 19 May 2025).

Self-consistency and rational prompt design shift more tasks into CFRB, while over-fragmentation (e.g., excessive least-to-most division, complex-CoT) can degrade performance if demonstrations become too granular.

6. Workflow for PFRB Localization and Improvement

The RBF++ recipe for PFRB assessment and enhancement, as detailed in (Chen et al., 19 May 2025), is:

Identify measurable and unmeasurable subtasks, and their respective difficulty axes.
For measurable branches, empirically estimate $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 6 by analyzing accuracy vs. difficulty at thresholds $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 7.
For unmeasurable components, instantiate constants $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 8 using direct accuracy in non-CoT settings.
Decompose coarse unmeasurable RBs into knowledge ( $\mathcal{B}_{\text{Acc}=K_1}(T \mid M) \coloneqq \sup \{ d \mid \text{Acc}(T \mid d, M) \ge K_1 \}.$ 9) and perception ( $\text{Acc}(T \mid d, M)$ 0), measuring each as possible or holding the other fixed.
Assemble the full RB using the harmonic-mean forms, including all constants and per-branch measurements.
Apply targeted interventions to raise specific sub-boundaries and contract the PFRB.
Re-evaluate the model, seeking rightward (more difficult) movement of the 90%/10% RB contours and a reduced PFRB gap.

This closed-loop process rigorously quantifies and advances LLM CoT performance beyond empirical status-quo.

7. Theoretical and Practical Significance

The PFRB, as formalized by RBF++, bridges the gap between largely qualitative assessments of LLM reasoning and rigorous, model-agnostic quantification of cognitive performance ceilings. The framework’s harmonic-mean combination law and constant-division mechanisms provide a uniquely compositional approach to understanding both measurable and unmeasurable task structures. Experimental results establish scaling relationships between $\text{Acc}(T \mid d, M)$ 1 and benchmark accuracy, validating the central theoretical insight that PFRBs delimit the regimes of partial capability—and thus, optimization focus—in real-world modeling. This framework enables both interpretability and actionable model improvement by rendering the boundaries of reasoning competence both measurable and mutable (Chen et al., 19 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Partially Feasible Reasoning Boundary (PFRB).