Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 51 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (2502.06772v2)

Published 10 Feb 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing more explainable reasoning structures than DeepSeek-R1 and o3-mini, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux

Summary

  • The paper introduces a hierarchical LLM framework that uses a 500-template thought library and reinforcement learning to optimize reasoning trajectories.
  • It employs an inference scaling system that adaptively selects high-level thought templates, achieving 91.2% accuracy on the MATH benchmark.
  • The model, using a 32B-parameter LLM trained on 8 GPUs, outperforms competitors by significant margins on math problem benchmarks.

I understand you're interested in the paper "ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates." The paper presents ReasonFlux, a hierarchical LLM reasoning framework designed to optimize the reasoning search space. This approach reportedly enhances mathematical reasoning capabilities, surpassing those of models like OpenAI's o1-preview and DeepSeek V3. The 32B parameter ReasonFlux model was trained using 8 GPUs and incorporates three main innovations: a structured thought template library, hierarchical reinforcement learning, and an inference scaling system.

The paper claims the following contributions:

  • The creation of a structured thought template library containing approximately 500 high-level, generic thought templates. These templates are designed to generalize to similar reasoning problems.
  • Hierarchical reinforcement learning is performed on a sequence of thought templates instead of raw Chain-of-Thought (CoT) data. This optimizes a base LLM to plan an optimal template trajectory for addressing complex problems gradually.
  • A new inference scaling system that facilitates hierarchical LLM reasoning by scaling thought templates adaptively during inference.

The paper presents numerical results:

  • ReasonFlux-32B achieves a 91.2% accuracy on the MATH benchmark, outperforming o1-preview by 6.7%.
  • On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves 56.7% of problems, exceeding o1-preview and DeepSeek-V3 by 27% and 45%, respectively.

The paper suggests that current LLM reasoning strategies can be categorized into deliberate search methods (e.g., Tree of Thoughts (ToT) and Graph of Thoughts (GoT)) and reward-model-guided methods. It argues that these existing strategies suffer from high computational costs and limited generalization ability due to their reliance on manually designed search strategies and instance/step-level rewards.

To address these limitations, the authors propose ReasonFlux, which employs Retrieval-Augmented Generation (RAG) to automatically retrieve relevant high-level thought templates at inference time, configuring optimal thought template trajectories. The paper details the construction of a structured template library with 500 thought templates. Instead of optimizing long CoT trajectories, the paper details how hierarchical reinforcement learning is performed on high-level thought templates, optimizing a base LLM to learn an optimal thought template trajectory. The paper introduces an inference scaling system that simplifies the search for reasoning paths and improves reasoning ability for complex problems by selecting an appropriate high-level template for each sub-problem.

The structured thought template library consists of templates TiT_i containing: $T_{\text{nam}$ (name), $T_{\text{tag}$ (tags for retrieval), $T_{\text{des}$ (description), $T_{\text{sco}$ (scope), TaT_a (application steps), and $T_{\text{exa}$ (examples). The entire library $\mathcal{D}_{\text{temp}$ is a set of thought templates as mentioned:

$\mathcal{D}_{\text{temp} = \{T_1, T_2, ..., T_m\}$

where mm is the total number of templates.

The paper describes a hierarchical reinforcement learning process that begins with using the structured template library $\mathcal{D}_{\text{temp}$ to create a training dataset $\mathcal{D}_{\text{train}$. The training dataset contains template names $T_{\text{nam}$, their associated tags $T_{\text{tag}$, detailed descriptions of their underlying principles $T_{\text{des}$, and a clear delineation of their applicable scopes $T_{\text{sco}$, represented as tuples $(T_{\text{nam}, T_{\text{tag}, T_{\text{des}, T_{\text{sco})$. The process involves fine-tuning a base LLM, denoted as π\pi, on this dataset $\mathcal{D}_{\text{train}$ according to the optimization objective:

$\mathcal{L}_{\text{struct} = -\mathbb{E}_{\mathcal{D}_{\text{train} \left[\log \pi(T_{\text{des}, T_{\text{sco} | T_{\text{nam}, T_{\text{tag}) \right]$

The preference learning on thought template trajectory involves the finetuned LLM $\pi_{\text{struct}$ planning a sequence of high-level thought templates (i.e., thought template trajectory $\mathbb{T}_{\text{traj}$) for an input problem xx, associating each step with the most relevant template from the library. Given an input problem xx, πstruct\pi_{struct} analyzes and abstracts the problem's conditional information, identifying the core mathematical concepts and relationships involved. Based on this abstract representation, the navigator $\pi_{\text{struct}$ configures a trajectory $\mathbb{T}_{\text{traj} = \{s_1, s_2, ..., s_n\}$, where each sis_i represents a high-level step in the reasoning process, associated with a specific template name retrieved from the library which could be used to solve the problem, denoted as TiT_i. Each retrieved template TiT_i is then instantiated with specific details from the input problem xx and provides fine-grained guidance to a separate inference LLM denoted as $\pi_{\text{inf}$ to solve the problem.

To measure the effectiveness and generalization ability of a given trajectory, a set of problems Xsim\mathcal{X}_{sim} is utilized that are similar to the original input problem xx, including xx itself. The instantiated templates along the trajectory $\mathbb{T}_{\text{traj}$ is used to guide πinf\pi_{inf} in solving each problem xi∈Xsimx_i \in \mathcal{X}_{sim}. The average accuracy achieved by πinf\pi_{inf} across these problems serves as the trajectory reward $R(\mathbb{T}_{\text{traj})$. Formally:

$R(\mathbb{T}<em>{\text{traj}) = \frac{1}{|\mathcal{X}</em>{sim}|} \sum_{x_i \in \mathcal{X}<em>{sim} \text{Acc}(\pi</em>{inf}(x_i, \mathbb{T}_{\text{traj}))</p> <p>where $\text{Acc}(\pi_{inf}(x_i, \mathbb{T}_{\text{traj}))representstheaccuracyof represents the accuracy of \pi_{inf}insolvingproblem in solving problem x_iwhenguidedbythetrajectory when guided by the trajectory \mathbb{T}_{\text{traj}.</p><p>Thelossfunctionforoptimizing.</p> <p>The loss function for optimizing \pi_{struct}is:</p><p> is:</p> <p>\mathcal{L}_{\text{TTR}(\theta) = -\mathbb{E}_{(x, (\mathbb{T}_{\text{traj}^+, \mathbb{T}_{\text{traj}^-)) \sim \mathcal{D}_{pair} \Bigg[ \log \sigma \bigg(\beta \log \frac{\pi_{\theta}(\mathbb{T}_{\text{traj}^+|x)}{\pi_{sft}(\mathbb{T}_{\text{traj}^+|x)} - \beta \log \frac{\pi_{\theta}(\mathbb{T}_{\text{traj}^-|x)}{\pi_{sft}(\mathbb{T}_{\text{traj}^-|x)}\bigg) \Bigg]</p><p>where</p> <p>where \mathcal{D}_{pair}isadatasetofoptimizationpairs, is a dataset of optimization pairs, \mathbb{T}_{\text{traj}^+and and \mathbb{T}_{\text{traj}^-}aretwotrajectorieswhere are two trajectories where R(\mathbb{T}_{\text{traj}^+) > R(\mathbb{T}_{\text{traj}^-),, \pi_{\theta}istheLLMbeingoptimized,and is the LLM being optimized, and \pi_{sft}isthesupervisedfine−tunedLLM.</p><p>Theinferencescalingsystemleveragesautomaticallyplannedtrajectoriesanddynamicallyretrievedthoughttemplates.Givenaninputproblem is the supervised fine-tuned LLM.</p> <p>The inference scaling system leverages automatically planned trajectories and dynamically retrieved thought templates. Given an input problem x,ReasonFluxextractsthecoremathematicalconceptsandrelationships,representedas, ReasonFlux extracts the core mathematical concepts and relationships, represented as a(x),andconfiguresanoptimaltemplatetrajectory, and configures an optimal template trajectory \mathbb{T}_{\text{traj}^* = \{s_1^*, s_2^*, ..., s_n^*\}.Eachstep. Each step s_i^*withinthetrajectoryisassociatedwithatemplatename within the trajectory is associated with a template name T_\text{nam}and and T_\text{tag}forretrieval.Theretrievalprocesscanberepresentedas:</p><p> for retrieval. The retrieval process can be represented as:</p> <p>T_{\text{rag} = ReasonFlux(\{T_\text{nam}^i,T_\text{tag}^i\}_{i=1}^n, \mathcal{D}_{\text{temp})</p><p>ReasonFluxinstructs</p> <p>ReasonFlux instructs \pi_{inf}toinstantiateeachsteps to instantiate each steps s_i^*alongwithcorrespondingtemplate along with corresponding template T_iandproblem−specificdetailsfrom and problem-specific details from x,transformingintoconcreteinstantiatedreasoningsteps, transforming into concrete instantiated reasoning steps \hat{s_i}:</p><p>:</p> <p>\hat{s_i} = \pi_{inf}(x_i,s_i,T_i)</p><p>Afterobtainingtheinstantiatedstep</p> <p>After obtaining the instantiated step \hat{s_i},itisevaluatedandanalyzedbyReasonFlux,andrepresentedasprocess, it is evaluated and analyzed by ReasonFlux, and represented as process \delta_i =ReasonFlux(\mathbb{T}_{\text{traj}^*,\hat{s_i}).Basedonthisevaluatedresultandanalysis,ReasonFluxdecidewhethertorefinethetrajectory,potentiallyadjustingsubsequentstepsorevenretrievingalternativetemplates.Thisiterativerefinementcanbeexpressedas:</p><p>. Based on this evaluated result and analysis, ReasonFlux decide whether to refine the trajectory, potentially adjusting subsequent steps or even retrieving alternative templates. This iterative refinement can be expressed as:</p> <p>\mathbb{T}_{\text{traj}^* \leftarrow ReasonFlux(\mathbb{T}_{\text{traj}^*, \delta_i)$

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 posts and received 93 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com