Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 51 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 90 tok/s Pro

Kimi K2 205 tok/s Pro

GPT OSS 120B 440 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (2502.06772v2)

Published 10 Feb 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing more explainable reasoning structures than DeepSeek-R1 and o3-mini, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux

Summary

The paper introduces a hierarchical LLM framework that uses a 500-template thought library and reinforcement learning to optimize reasoning trajectories.
It employs an inference scaling system that adaptively selects high-level thought templates, achieving 91.2% accuracy on the MATH benchmark.
The model, using a 32B-parameter LLM trained on 8 GPUs, outperforms competitors by significant margins on math problem benchmarks.

I understand you're interested in the paper "ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates." The paper presents ReasonFlux, a hierarchical LLM reasoning framework designed to optimize the reasoning search space. This approach reportedly enhances mathematical reasoning capabilities, surpassing those of models like OpenAI's o1-preview and DeepSeek V3. The 32B parameter ReasonFlux model was trained using 8 GPUs and incorporates three main innovations: a structured thought template library, hierarchical reinforcement learning, and an inference scaling system.

The paper claims the following contributions:

The creation of a structured thought template library containing approximately 500 high-level, generic thought templates. These templates are designed to generalize to similar reasoning problems.
Hierarchical reinforcement learning is performed on a sequence of thought templates instead of raw Chain-of-Thought (CoT) data. This optimizes a base LLM to plan an optimal template trajectory for addressing complex problems gradually.
A new inference scaling system that facilitates hierarchical LLM reasoning by scaling thought templates adaptively during inference.

The paper presents numerical results:

ReasonFlux-32B achieves a 91.2% accuracy on the MATH benchmark, outperforming o1-preview by 6.7%.
On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves 56.7% of problems, exceeding o1-preview and DeepSeek-V3 by 27% and 45%, respectively.

The paper suggests that current LLM reasoning strategies can be categorized into deliberate search methods (e.g., Tree of Thoughts (ToT) and Graph of Thoughts (GoT)) and reward-model-guided methods. It argues that these existing strategies suffer from high computational costs and limited generalization ability due to their reliance on manually designed search strategies and instance/step-level rewards.

To address these limitations, the authors propose ReasonFlux, which employs Retrieval-Augmented Generation (RAG) to automatically retrieve relevant high-level thought templates at inference time, configuring optimal thought template trajectories. The paper details the construction of a structured template library with 500 thought templates. Instead of optimizing long CoT trajectories, the paper details how hierarchical reinforcement learning is performed on high-level thought templates, optimizing a base LLM to learn an optimal thought template trajectory. The paper introduces an inference scaling system that simplifies the search for reasoning paths and improves reasoning ability for complex problems by selecting an appropriate high-level template for each sub-problem.

The structured thought template library consists of templates $T_i$ containing: $T_{\text{nam}$ (name), $T_{\text{tag}$ (tags for retrieval), $T_{\text{des}$ (description), $T_{\text{sco}$ (scope), $T_a$ (application steps), and $T_{\text{exa}$ (examples). The entire library $\mathcal{D}_{\text{temp}$ is a set of thought templates as mentioned:

$\mathcal{D}_{\text{temp} = \{T_1, T_2, ..., T_m\}$

where $m$ is the total number of templates.

The paper describes a hierarchical reinforcement learning process that begins with using the structured template library $\mathcal{D}_{\text{temp}$ to create a training dataset $\mathcal{D}_{\text{train}$. The training dataset contains template names $T_{\text{nam}$, their associated tags $T_{\text{tag}$, detailed descriptions of their underlying principles $T_{\text{des}$, and a clear delineation of their applicable scopes $T_{\text{sco}$, represented as tuples $(T_{\text{nam}, T_{\text{tag}, T_{\text{des}, T_{\text{sco})$. The process involves fine-tuning a base LLM, denoted as $\pi$ , on this dataset $\mathcal{D}_{\text{train}$ according to the optimization objective:

$\mathcal{L}_{\text{struct} = -\mathbb{E}_{\mathcal{D}_{\text{train} \left[\log \pi(T_{\text{des}, T_{\text{sco} | T_{\text{nam}, T_{\text{tag}) \right]$

The preference learning on thought template trajectory involves the finetuned LLM $\pi_{\text{struct}$ planning a sequence of high-level thought templates (i.e., thought template trajectory $\mathbb{T}_{\text{traj}$) for an input problem $x$ , associating each step with the most relevant template from the library. Given an input problem $x$ , $\pi_{struct}$ analyzes and abstracts the problem's conditional information, identifying the core mathematical concepts and relationships involved. Based on this abstract representation, the navigator $\pi_{\text{struct}$ configures a trajectory $\mathbb{T}_{\text{traj} = \{s_1, s_2, ..., s_n\}$, where each $s_i$ represents a high-level step in the reasoning process, associated with a specific template name retrieved from the library which could be used to solve the problem, denoted as $T_i$ . Each retrieved template $T_i$ is then instantiated with specific details from the input problem $x$ and provides fine-grained guidance to a separate inference LLM denoted as $\pi_{\text{inf}$ to solve the problem.

To measure the effectiveness and generalization ability of a given trajectory, a set of problems $\mathcal{X}_{sim}$ is utilized that are similar to the original input problem $x$ , including $x$ itself. The instantiated templates along the trajectory $\mathbb{T}_{\text{traj}$ is used to guide $\pi_{inf}$ in solving each problem $x_i \in \mathcal{X}_{sim}$ . The average accuracy achieved by $\pi_{inf}$ across these problems serves as the trajectory reward $R(\mathbb{T}_{\text{traj})$. Formally:

$R(\mathbb{T}{\text{traj}) = \frac{1}{|\mathcal{X}{sim}|} \sum_{x_i \in \mathcal{X}{sim} \text{Acc}(\pi{inf}(x_i, \mathbb{T}_{\text{traj})) where $\text{Acc}(\pi_{inf}(x_i, \mathbb{T}_{\text{traj})) $represents the accuracy of$ \pi_{inf} $in solving problem$ x_i $when guided by the trajectory$ \mathbb{T}_{\text{traj} $. The loss function for optimizing$ \pi_{struct} $is: $ \mathcal{L}_{\text{TTR}(\theta) = -\mathbb{E}_{(x, (\mathbb{T}_{\text{traj}^+, \mathbb{T}_{\text{traj}^-)) \sim \mathcal{D}_{pair} \Bigg[ \log \sigma \bigg(\beta \log \frac{\pi_{\theta}(\mathbb{T}_{\text{traj}^+|x)}{\pi_{sft}(\mathbb{T}_{\text{traj}^+|x)} - \beta \log \frac{\pi_{\theta}(\mathbb{T}_{\text{traj}^-|x)}{\pi_{sft}(\mathbb{T}_{\text{traj}^-|x)}\bigg) \Bigg] $ where$ \mathcal{D}_{pair} $is a dataset of optimization pairs,$ \mathbb{T}_{\text{traj}^+ $and$ \mathbb{T}_{\text{traj}^-} $are two trajectories where$ R(\mathbb{T}_{\text{traj}^+) > R(\mathbb{T}_{\text{traj}^-) $,$ \pi_{\theta} $is the LLM being optimized, and$ \pi_{sft} $is the supervised fine-tuned LLM. The inference scaling system leverages automatically planned trajectories and dynamically retrieved thought templates. Given an input problem$ x $, ReasonFlux extracts the core mathematical concepts and relationships, represented as$ a(x) $, and configures an optimal template trajectory$ \mathbb{T}_{\text{traj}^* = \{s_1^*, s_2^*, ..., s_n^*\} $. Each step$ s_i^* $within the trajectory is associated with a template name$ T_\text{nam} $and$ T_\text{tag} $for retrieval. The retrieval process can be represented as: $ T_{\text{rag} = ReasonFlux(\{T_\text{nam}^i,T_\text{tag}^i\}_{i=1}^n, \mathcal{D}_{\text{temp}) $ ReasonFlux instructs$ \pi_{inf} $to instantiate each steps$ s_i^* $along with corresponding template$ T_i $and problem-specific details from$ x $, transforming into concrete instantiated reasoning steps$ \hat{s_i} $: $ \hat{s_i} = \pi_{inf}(x_i,s_i,T_i) $ After obtaining the instantiated step$ \hat{s_i} $, it is evaluated and analyzed by ReasonFlux, and represented as process$ \delta_i =ReasonFlux(\mathbb{T}_{\text{traj}^*,\hat{s_i}) $. Based on this evaluated result and analysis, ReasonFlux decide whether to refine the trajectory, potentially adjusting subsequent steps or even retrieving alternative templates. This iterative refinement can be expressed as: $ \mathbb{T}_{\text{traj}^* \leftarrow ReasonFlux(\mathbb{T}_{\text{traj}^*, \delta_i)$