- The paper introduces a reinforcement learning framework that enables large language models to perform modular, multi-round reasoning, overcoming fixed context limitations.
- It employs an outcome-based reward within a GRPO framework and LoRA-based fine-tuning, achieving notable accuracy improvements on MATH500 (48.6%) and AIME2024 (6.67%).
- The approach demonstrates enhanced sample efficiency, simplicity in reward design, and scalability for complex reasoning tasks.
MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
The paper "MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs" (2507.02851) introduces a reinforcement learning (RL) framework for training LLMs to perform modular, multi-round reasoning that overcomes the inherent limitations of fixed context size. The authors propose a method—MOTIF—that enables LLMs to reason over extended contexts by decomposing complex reasoning tasks into multiple inference rounds, each producing partial progress, and then aggregating these steps to reach a final answer.
Motivation and Context
Recent work has established a positive correlation between the number of reasoning tokens generated by LLMs and their accuracy on complex tasks, particularly in mathematical and logical reasoning. However, the finite context window of transformer-based LLMs imposes a hard limit on the number of tokens that can be attended to, constraining the depth and breadth of reasoning. While some proprietary models have increased context sizes, and various architectural innovations have been proposed to extend attention span, these approaches either require significant computational resources or introduce additional complexity.
Multi-round inference architectures, such as those in INFTYTHINK and related works, have demonstrated that iterative reasoning—where the model is prompted to make incremental progress over several rounds—can effectively extend the model's reasoning capabilities beyond its native context window. However, prior RL-based approaches for such architectures often rely on process-based or stepwise rewards, require additional supervision, or use dual-model systems, which complicate training and deployment.
Methodology
MOTIF is designed to train LLMs for multi-round, modular reasoning using a simple, outcome-based reward function within the Group Relative Policy Optimization (GRPO) RL framework. The key components of the approach are:
- Multi-Round Inference Protocol: The LLM is prompted to solve a problem in three rounds. In each round, it produces a
<reasoning>
section (detailing its thought process) and an <answer>
section (summarizing progress). The output from each round is fed as additional context into the next round. Only in the final round is the model expected to produce the boxed final answer.
- Outcome-Based Reward: Instead of assigning rewards to intermediate steps, MOTIF evaluates the probability that a first-round response leads to a correct final answer after subsequent rounds. For each first-round output, multiple multi-round trajectories are sampled, and the reward is the average accuracy of the final answers, plus a format reward for adherence to the required output structure.
- Parameter-Efficient Fine-Tuning: The method uses LoRA for efficient adaptation of the Qwen2.5-3B-Instruct model, updating only a small fraction of parameters.
- Sample Efficiency: To ensure fair comparison, MOTIF is trained with only 15% of the data used for vanilla GRPO, matching wall-clock training time.
The training pipeline is summarized in the following pseudocode:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
def motif_training(model, dataset, rounds=3, m=8, k=4, epochs=E):
for epoch in range(epochs):
for q, a in dataset:
first_round_outputs = [model.infer(q, round=1, temp=0.8) for _ in range(m)]
rewards = []
for o in first_round_outputs:
final_answers = []
for _ in range(k):
context = q
ans = o
for r in range(2, rounds+1):
ans = model.infer(context + ans, round=r)
final_answers.append(extract_boxed(ans))
accuracy_reward = sum([fa == a for fa in final_answers]) / k
format_reward = check_format(o)
rewards.append(accuracy_reward + format_reward)
model.update_grpo(first_round_outputs, rewards) |
Experimental Results
MOTIF was evaluated on the MATH500 and AIME2024 benchmarks, using pass@1 accuracy as the primary metric. The results are as follows:
Model |
MATH500 |
AIME2024 |
Qwen2.5-3B-Instruct (base) |
37.6% |
0.0% |
GRPO training |
44.8% |
3.33% |
MOTIF training |
48.6% |
6.67% |
MOTIF achieves a 3.8% absolute improvement over vanilla GRPO on MATH500 and a 3.3% improvement on AIME2024, despite using only 15% of the training samples. The training curves indicate that MOTIF achieves higher expected reward more rapidly and with shorter average response lengths per round, suggesting that modularization leads to more focused and efficient reasoning steps.
Implications and Discussion
The primary contribution of MOTIF is the demonstration that outcome-based RL, applied to modular, multi-round inference, can yield significant improvements in reasoning accuracy and sample efficiency for LLMs. The approach eliminates the need for process-level supervision or complex reward shaping, relying solely on the correctness of the final answer and adherence to output format.
Practical implications include:
- Scalability: The modular inference protocol can be applied to any LLM with a limited context window, enabling more complex reasoning without architectural changes.
- Sample Efficiency: The method achieves higher accuracy with fewer training samples, reducing computational cost and data requirements.
- Simplicity of Reward Design: The outcome-based reward is straightforward to implement and avoids reward hacking associated with process-based rewards.
Theoretical implications:
- The results support the hypothesis that decomposing reasoning into modular steps, each optimized for future correctness, can improve both the depth and reliability of LLM reasoning.
- The approach suggests a direction for RL in LLMs where the focus is on optimizing for end-task performance over multi-step trajectories, rather than stepwise correctness.
Future Directions
Potential avenues for further research include:
- Extending MOTIF to more rounds or adaptive round allocation based on task complexity.
- Applying the method to domains beyond mathematics, such as code generation or scientific reasoning.
- Investigating the integration of external memory or retrieval mechanisms to further enhance long-context reasoning.
- Exploring the combination of outcome-based and process-based rewards for tasks where intermediate correctness is also valuable.
MOTIF provides a practical and effective framework for training LLMs to reason modularly, offering both empirical improvements and a foundation for future work in scalable, sample-efficient RL for LLMs.