Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 105 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 427 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

SplitReason: Learning To Offload Reasoning (2504.16379v1)

Published 23 Apr 2025 in cs.CL

Abstract: Reasoning in LLMs tends to produce substantially longer token generation sequences than simpler language modeling tasks. This extended generation length reflects the multi-step, compositional nature of reasoning and is often correlated with higher solution accuracy. From an efficiency perspective, longer token generation exacerbates the inherently sequential and memory-bound decoding phase of LLMs. However, not all parts of this expensive reasoning process are equally difficult to generate. We leverage this observation by offloading only the most challenging parts of the reasoning process to a larger, more capable model, while performing most of the generation with a smaller, more efficient model; furthermore, we teach the smaller model to identify these difficult segments and independently trigger offloading when needed. To enable this behavior, we annotate difficult segments across 18k reasoning traces from the OpenR1-Math-220k chain-of-thought (CoT) dataset. We then apply supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to a 1.5B-parameter reasoning model, training it to learn to offload the most challenging parts of its own reasoning process to a larger model. This approach improves AIME24 reasoning accuracy by 24% and 28.3% while offloading 1.35% and 5% of the generated tokens respectively. We open-source our SplitReason model, data, code and logs.

Summary

SplitReason: Efficient Reasoning through Selective Model Offloading

The paper "SplitReason: Learning To Offload Reasoning" introduces an innovative approach aimed at optimizing both the accuracy and efficiency of reasoning tasks performed by LLMs. The central thesis of the research is the concept of selectively offloading complex reasoning segments to a larger, more capable model, while utilizing a smaller model for the majority of token generation. This method is predicated on the observation that certain reasoning segments are inherently more difficult and computationally expensive, thereby warranting the involvement of a larger model.

The research begins by tackling the inefficiency that arises in reasoning tasks due to their extended token generation requirements, which consequently exacerbate the sequential and memory-bound decoding phase of LLMs. The authors propose a bifurcated model arrangement, termed as SplitReason (\sr), in which a smaller model processes simpler segments and learns to autonomously trigger offloading of challenging segments to a larger model. To facilitate this process, the research employs supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) on a 1.5B-parameter model using a specialized dataset annotated with segments delineating difficulty levels.

Quantitative results reported in the paper indicate marked improvements in reasoning accuracy on benchmark tasks such as AIME24. Specifically, accuracy improves by 24% and 28.3% while only offloading a minimal fraction (1.35% and 5%, respectively) of the generated tokens to the larger model. These results substantiate the efficacy of selective reasoning offloading in achieving significant accuracy gains without corresponding increases in computational cost. Moreover, the methodology achieves considerable end-to-end speedup, reducing heavy computational loading traditionally associated with the use of a larger model exclusively.

The implications of this research are multifaceted. Practically, the SplitReason methodology advances the operational efficiency of LLMs, enabling cost-effective deployment. It opens pathways to scalable deployment of complex reasoning tasks in real-world applications, where computational resources are constrained. Theoretically, it paves the way for further exploration into decentralized and collaborative model architectures, highlighting the potential of leveraging hierarchical model engagements tailored to task-specific complexities.

Future developments spurred by this research may include refining the training protocol to optimize the switching behavior of smaller models, improving accuracy modeling during true offloading, and generalizing this approach to encompass varied multi-model configurations across diverse linguistic and non-linguistic tasks.

In conclusion, "SplitReason: Learning To Offload Reasoning" presents a robust framework for achieving improved inference performance in reasoning-heavy tasks by introducing a dynamic model allocation strategy. By demonstrating substantial gains in both accuracy and efficiency, it represents a meaningful contribution to the field of AI, shedding light on the potential for strategic, task-sensitive model collaborations to enhance computational effectiveness in LLM operations.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 1 like.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com