MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs (2507.02851v1)

Published 3 Jul 2025 in cs.CL, cs.AI, cs.IT, cs.LG, cs.SY, eess.SY, and math.IT

Abstract: Recent advancements in the reasoning capabilities of LLMs show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose $\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at https://github.com/purbeshmitra/MOTIF and https://huggingface.co/purbeshmitra/MOTIF, respectively.

Summary

The paper introduces a reinforcement learning framework that enables large language models to perform modular, multi-round reasoning, overcoming fixed context limitations.
It employs an outcome-based reward within a GRPO framework and LoRA-based fine-tuning, achieving notable accuracy improvements on MATH500 (48.6%) and AIME2024 (6.67%).
The approach demonstrates enhanced sample efficiency, simplicity in reward design, and scalability for complex reasoning tasks.

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

The paper "MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs" (2507.02851) introduces a reinforcement learning (RL) framework for training LLMs to perform modular, multi-round reasoning that overcomes the inherent limitations of fixed context size. The authors propose a method—MOTIF—that enables LLMs to reason over extended contexts by decomposing complex reasoning tasks into multiple inference rounds, each producing partial progress, and then aggregating these steps to reach a final answer.

Motivation and Context

Recent work has established a positive correlation between the number of reasoning tokens generated by LLMs and their accuracy on complex tasks, particularly in mathematical and logical reasoning. However, the finite context window of transformer-based LLMs imposes a hard limit on the number of tokens that can be attended to, constraining the depth and breadth of reasoning. While some proprietary models have increased context sizes, and various architectural innovations have been proposed to extend attention span, these approaches either require significant computational resources or introduce additional complexity.

Multi-round inference architectures, such as those in INFTYTHINK and related works, have demonstrated that iterative reasoning—where the model is prompted to make incremental progress over several rounds—can effectively extend the model's reasoning capabilities beyond its native context window. However, prior RL-based approaches for such architectures often rely on process-based or stepwise rewards, require additional supervision, or use dual-model systems, which complicate training and deployment.

Methodology

MOTIF is designed to train LLMs for multi-round, modular reasoning using a simple, outcome-based reward function within the Group Relative Policy Optimization (GRPO) RL framework. The key components of the approach are:

Multi-Round Inference Protocol: The LLM is prompted to solve a problem in three rounds. In each round, it produces a <reasoning> section (detailing its thought process) and an <answer> section (summarizing progress). The output from each round is fed as additional context into the next round. Only in the final round is the model expected to produce the boxed final answer.
Outcome-Based Reward: Instead of assigning rewards to intermediate steps, MOTIF evaluates the probability that a first-round response leads to a correct final answer after subsequent rounds. For each first-round output, multiple multi-round trajectories are sampled, and the reward is the average accuracy of the final answers, plus a format reward for adherence to the required output structure.
Parameter-Efficient Fine-Tuning: The method uses LoRA for efficient adaptation of the Qwen2.5-3B-Instruct model, updating only a small fraction of parameters.
Sample Efficiency: To ensure fair comparison, MOTIF is trained with only 15% of the data used for vanilla GRPO, matching wall-clock training time.

The training pipeline is summarized in the following pseudocode:

def motif_training(model, dataset, rounds=3, m=8, k=4, epochs=E):
    for epoch in range(epochs):
        for q, a in dataset:
            first_round_outputs = [model.infer(q, round=1, temp=0.8) for _ in range(m)]
            rewards = []
            for o in first_round_outputs:
                final_answers = []
                for _ in range(k):
                    context = q
                    ans = o
                    for r in range(2, rounds+1):
                        ans = model.infer(context + ans, round=r)
                    final_answers.append(extract_boxed(ans))
                accuracy_reward = sum([fa == a for fa in final_answers]) / k
                format_reward = check_format(o)
                rewards.append(accuracy_reward + format_reward)
            model.update_grpo(first_round_outputs, rewards)

Experimental Results

MOTIF was evaluated on the MATH500 and AIME2024 benchmarks, using pass@1 accuracy as the primary metric. The results are as follows:

Model	MATH500	AIME2024
Qwen2.5-3B-Instruct (base)	37.6%	0.0%
GRPO training	44.8%	3.33%
MOTIF training	48.6%	6.67%

MOTIF achieves a 3.8% absolute improvement over vanilla GRPO on MATH500 and a 3.3% improvement on AIME2024, despite using only 15% of the training samples. The training curves indicate that MOTIF achieves higher expected reward more rapidly and with shorter average response lengths per round, suggesting that modularization leads to more focused and efficient reasoning steps.

Implications and Discussion

The primary contribution of MOTIF is the demonstration that outcome-based RL, applied to modular, multi-round inference, can yield significant improvements in reasoning accuracy and sample efficiency for LLMs. The approach eliminates the need for process-level supervision or complex reward shaping, relying solely on the correctness of the final answer and adherence to output format.

Practical implications include:

Scalability: The modular inference protocol can be applied to any LLM with a limited context window, enabling more complex reasoning without architectural changes.
Sample Efficiency: The method achieves higher accuracy with fewer training samples, reducing computational cost and data requirements.
Simplicity of Reward Design: The outcome-based reward is straightforward to implement and avoids reward hacking associated with process-based rewards.

Theoretical implications:

The results support the hypothesis that decomposing reasoning into modular steps, each optimized for future correctness, can improve both the depth and reliability of LLM reasoning.
The approach suggests a direction for RL in LLMs where the focus is on optimizing for end-task performance over multi-step trajectories, rather than stepwise correctness.

Future Directions

Potential avenues for further research include:

Extending MOTIF to more rounds or adaptive round allocation based on task complexity.
Applying the method to domains beyond mathematics, such as code generation or scientific reasoning.
Investigating the integration of external memory or retrieval mechanisms to further enhance long-context reasoning.
Exploring the combination of outcome-based and process-based rewards for tasks where intermediate correctness is also valuable.

MOTIF provides a practical and effective framework for training LLMs to reason modularly, offering both empirical improvements and a foundation for future work in scalable, sample-efficient RL for LLMs.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (2)

GitHub

GitHub - purbeshmitra/MOTIF: MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs

Tweets

https://twitter.com/PapersInML/status/1942283273578639711

alphaXiv

MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs (28 likes, 0 questions)