Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models (2505.14810v2)

Published 20 May 2025 in cs.CL and cs.AI

Abstract: Instruction-following is essential for aligning LLMs with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.

Summary

An Insight into Instruction Following and Mathematical Reasoning in LLMs

The increasing capability of Large Reasoning Models (LRMs) has spotlighted their prowess in solving complex mathematical problems. Yet, there remains a critical area that is underemphasized: the ability of these models to follow explicit instructions. This paper delivers a thorough exploration of this issue, introducing the MathIF benchmark to rigorously assess instruction-following in the context of mathematical reasoning tasks. The principal findings indicate a notable performance decrease in instruction adherence as LRM reasoning prowess scales, presenting implications for both model development and the broader scope of artificial intelligence.

Core Findings and Contributions

MathIF Benchmark Creation: The paper introduces MathIF, a benchmark uniquely designed to evaluate instruction-following within mathematical domains. This consists of Python-verifiable constraints tailored for mathematical reasoning tasks, distinct from benchmarks focused on more general usages.
Instruction Following Versus Reasoning Capability: A critical insight from this paper is the observed trade-off between reasoning strength and instruction adherence. While LRMs demonstrate advanced reasoning abilities, this often comes at the expense of accurately following user-specified instructions, especially as the complexity and length of the generated reasoning chains increase.
Instruction Following Performance Across Models: The paper evaluates 23 different LRMs using MathIF, uncovering a broad inability to maintain instruction fidelity, notably within larger model scales. Qwen3 models stand out for relatively superior instruction-following ability across examined sizes.
Training Paradigms and Their Impact: The analysis suggests that reasoning-oriented training methods, such as supervised fine-tuning and reinforcement learning, degrade instruction-following performance. This insight emphasizes the complexity of training paradigms where improving reasoning through longer chains of thought might inadvertently impact instruction-following negatively.
Mitigation Strategies: Interestingly, the paper reveals that simply shortening the reasoning chain or repeating constraints at the end of explicit reasoning could mitigate some loss in instruction-following ability. However, this improvement tends to be at the cost of reasoning power.

Implications and Future Directions

The findings from this research highlight a crucial dilemma in aligning reasoning models with user intent. The inherent trade-off between reasoning excellence and instruction adherence points to fundamental challenges in the LRM development process. This signals a need for refining training methods to cultivate models that are both highly intelligent in reasoning and robust in obeying instructions.

The implications of this paper extend to potential applications in real-world AI systems where compliance with explicit user directives is critical, underscoring a safety consideration in deploying such models effectively. Future research may explore novel training algorithms or model architectures to address this dual requirement.

In conclusion, the exploration presents significant contributions to understanding the dual demands of reasoning and instruction fidelity in LRMs. The MathIF benchmark sets a foundational tool for future exploration in striking the balance between intelligence and obedience in AI systems, pivotal in steering future advancements in the field.

Related Papers

GitHub

GitHub - TingchenFu/MathIF (2 stars)

Tweets

https://twitter.com/yafuly/status/1925753754961236006

https://twitter.com/_akhaliq/status/1925912208623460610

https://twitter.com/fly51fly/status/1926758399548485858

https://twitter.com/HuggingPapers/status/1925887363927937204

https://twitter.com/TingchenFu/status/1925850270870786411