Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 88 tok/s

Gemini 2.5 Pro 59 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 210 tok/s Pro

GPT OSS 120B 461 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

Failed-Step Fraction (FSF) in CoT Reasoning

Updated 24 September 2025

FSF is defined as the fraction of failed (abandoned) steps in a process graph, capturing the structural integrity of a reasoning trace.
Empirical studies reveal that lower FSF values are strongly correlated with higher accuracy in chain-of-thought outputs from large language models.
FSF enables targeted interventions, such as candidate selection and branch editing, to improve the overall quality and performance of reasoning systems.

A Failed-Step Fraction (FSF) is a quantitative metric that measures the proportion of failed steps—specifically, steps residing in abandoned or dead-end branches—within a process consisting of discrete, structured steps. FSF has recently emerged as a central quality indicator in chain-of-thought (CoT) reasoning for LLMs, but related notions can also be formalized in graph processes and other algorithmic settings. The FSF focuses on the structural integrity of the reasoning or propagation process, with lower FSF indicating fewer failed exploratory efforts and generally higher overall effectiveness.

1. Formal Definition and Computation

The Failed-Step Fraction is defined with respect to a process that can be represented as a directed or rooted graph, where each node corresponds to a discrete step (e.g., logical deduction, propagation, or a partial proof), and the process may branch or revisit previously recorded states.

Given a stepwise traced graph $G$ (for example, a reasoning trace in a chain-of-thought), let $\# \text{failed nodes}$ denote the number of steps that are part of branches ultimately abandoned (failed in the sense that those lines of reasoning do not contribute to the final solution), and $\# \text{all nodes}$ the total number of steps (nodes) in the graph. Then,

$\text{FSF} = \frac{\#\ \text{failed nodes}}{\#\ \text{all nodes}}$

as introduced in "What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT" (Feng et al., 23 Sep 2025).

In practical computation, a reasoning graph is extracted from each CoT (for instance, using structural cues or explicit delimiters between branches), failed nodes are identified as those on branches abandoned before completion or known to propagate error, and FSF is calculated as the normalized fraction above.

2. Applications in Chain-of-Thought Reasoning

When applied to chain-of-thought reasoning outputs from large reasoning models (LRMs), FSF operates as a structure-sensitive metric that goes beyond surface-level attributes such as total step count or proportion of review actions.

In systematic evaluations across ten large reasoning models on mathematics and scientific tasks, lower FSF is consistently associated with higher accuracy (Feng et al., 23 Sep 2025).
FSF robustly outperforms CoT length and Review Ratio in conditional correlation analyses for predicting correctness, indicating that simply generating longer traces or more reviewing is reliably less informative than minimizing structural failure.
FSF is computed on a per-CoT basis, using the underlying reasoning structure to count nodes and failed branches.

In summary, FSF provides a process-aware metric for quality assessment in reasoning tasks, targeting the failure rate of exploratory reasoning within a trace.

3. Comparison to Alternative Metrics: Length and Review Ratio

Traditional metrics for evaluating reasoning traces—such as total length (number of tokens, characters, or steps) and Review Ratio (fraction of review/checking steps)—are agnostic to the deeper process structure.

Metric	Definition	Typical Correlation with Accuracy
CoT Length	Total tokens/steps in CoT trace	Weak or negative
Review Ratio	Proportion of CoT devoted to reviewing steps	Negative in most models
FSF	Fraction of nodes in failed branches	Strong negative (higher FSF → lower accuracy)

Empirical studies show that CoT length can conflate verbosity with quality, and high Review Ratio rarely aids accuracy except for specific model outliers. FSF, in contrast, directly captures fault-prone exploration and has been shown to be the single most robust predictor of outcome correctness (Feng et al., 23 Sep 2025).

4. Causal Interventions: FSF-Guided Reasoning Selection and Editing

Two classes of experimental interventions substantiate the causal role of FSF in reasoning quality:

Test-Time Candidate Selection: Generating multiple candidate reasoning traces per input and selecting the CoT with lowest FSF yields substantial pass@1 accuracy gains (e.g., 5–13% increases on AIME 2025, on par or exceeding length- or review-based selection).
CoT Editing: Identifying and excising failed branches from a CoT, then continuing reasoning, leads to 8–14% absolute improvements in accuracy. Providing a short summary of the removed branch further assists, but not as effectively as removal. These effects demonstrate that failed paths bias subsequent reasoning and that minimizing FSF has a genuine causal benefit on downstream correctness.

These results confirm that FSF measures a critical structural property whose minimization improves the utility of model-generated reasoning traces.

5. Use of FSF in Graph Theoretic and Algorithmic Contexts

The concept of FSF, while formalized for CoT reasoning, connects naturally to propagation-based processes on graphs.

In zero forcing processes on graphs (as discussed in (Abara et al., 2022)), a “failed zero forcing set” is a stalled configuration from which propagation cannot force full graph coverage under a specified rule. Here, a natural analog of FSF is the largest fraction of vertices that can be colored blue and remain stalled, i.e.,

$\text{FSF}(G) = \frac{F(G)}{|G|}$

where $F(G)$ is the failed zero forcing number and $|G|$ is the order of the graph.

High FSF values are characteristic of graphs that resist propagation for many initializations, while low FSF signals easy or rapid spread.
This structural interpretation generalizes: in any process where stepwise propagation may stall or fail, FSF ratio quantifies the depth of such stalling in a normalized manner.

In both reasoning traces and combinatorial propagation, methods for estimating or bounding FSF rely on partitioning the underlying graph, identifying modules or bottlenecks, and systematically characterizing stalled or abandoned branches.

6. Empirical Impact and Implications

The introduction and operationalization of FSF have significant implications for the development, evaluation, and steering of large reasoning models and related graph algorithms.

Model evaluation: FSF serves as an informative and easily computable metric to triage or select reasoning traces likely to produce correct answers.
Model design: The negative relationship between FSF and correctness suggests that encouraging models to minimize failed exploratory branches—by decoupling, trimming, or discouraging errant lines of reasoning—can yield higher quality generations.
Structure-aware scaling: FSF-guided selection and editing endorse a paradigm shift from length-centric or review-centric reasoning toward structure-aware strategies.
Graph process analysis: FSF, as a normalized failure metric, supports deeper analysis of process stalling in combinatorial and networked systems.

A plausible implication is that future model prompting and decoding strategies in automated reasoning and related fields will use FSF or structurally analogous metrics to enforce or prefer minimal-failure, robust inference chains.

7. Summary

FSF, defined rigorously as the fraction of failed steps or nodes in a process graph, has proven to be a robust, causally relevant quality measure for structured stepwise reasoning. Recent work demonstrates its capacity to outpredict and outperform traditional token-level metrics, both as an observational and interventional tool. This metric captures the essential structural signature of failing less, providing a framework for both evaluation and targeted improvement in reasoning systems and process-driven combinatorial algorithms (Feng et al., 23 Sep 2025, Abara et al., 2022).

PDF Markdown Chat (Pro)

References (2)

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT (2025)

Minimum rank and failed zero forcing number of graphs (2022)

Follow Topic

Get notified by email when new papers are published related to Failed-Step Fraction (FSF).