Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning (2503.04697v1)

Published 6 Mar 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Reasoning LLMs have shown an uncanny ability to improve performance at test-time by ``thinking longer''-that is, by generating longer chain-of-thought sequences and hence using more compute. However, the length of their chain-of-thought reasoning is not controllable, making it impossible to allocate test-time compute to achieve a desired level of performance. We introduce Length Controlled Policy Optimization (LCPO), a simple reinforcement learning method that optimizes for accuracy and adherence to user-specified length constraints. We use LCPO to train L1, a reasoning LLM that produces outputs satisfying a length constraint given in its prompt. L1's length control allows for smoothly trading off computational cost and accuracy on a wide range of tasks, and outperforms the state-of-the-art S1 method for length control. Furthermore, we uncover an unexpected short chain-of-thought capability in models trained with LCPO. For instance, our 1.5B L1 model surpasses GPT-4o at equal reasoning lengths. Overall, LCPO enables precise control over reasoning length, allowing for fine-grained allocation of test-time compute and accuracy. We release code and models at https://www.cmu-l3.github.io/l1

Summary

  • The paper introduces Length Controlled Policy Optimization (LCPO), a reinforcement learning method for training reasoning language models to generate outputs adhering to user-specified length constraints.
  • Training the #1 model with LCPO demonstrates precise length control and achieves state-of-the-art accuracy, outperforming methods like S1 and matching GPT-4o performance at equivalent reasoning lengths.
  • LCPO generalizes across diverse reasoning tasks, allowing efficient trade-offs between computational cost and accuracy, and enabling highly competitive short Chain-of-Thought reasoning.

The paper introduces Length Controlled Policy Optimization (LCPO), a reinforcement learning method designed to give reasoning LLMs more control over the length of generated reasoning sequences. The method addresses the problem that current reasoning models lack the ability to control the length of their reasoning, which makes it impossible to allocate a test-time compute budget to achieve a target performance level.

The authors train #1, a reasoning LM (LLM) that produces outputs satisfying a length constraint given in its prompt. #1's length control allows for trading off computational cost and accuracy on a range of tasks, and it outperforms the S1 method for length control. The 1.5B parameter #1 model matches GPT-4o's performance at equal reasoning lengths.

The key contributions of the paper are:

  • The introduction of LCPO, the first RL (reinforcement learning)-based method for training reasoning LMs that produce outputs adhering to user-specified length constraints.
  • The use of LCPO to train #1, which demonstrates length control and achieves reasoning accuracy at fixed token budgets on math reasoning benchmarks.
  • A demonstration that length-control of #1 generalizes beyond math reasoning tasks to diverse tasks, including logical reasoning and general-domain benchmarks like MMLU (Massive Multitask Language Understanding).
  • A demonstration that LCPO-trained models can act as short-CoT (Chain of Thought) models, outperforming their non-reasoning counterparts and models such as GPT-4o, despite using the same token budget.

The paper discusses prior work in two key areas: test-time scaling in LLMs and length control in LLMs.

  • Test-Time Scaling in LLMs: Increasing test-time computation has consistently been shown to improve performance in tasks such as mathematical problem-solving and code generation. Approaches include parallel sampling of multiple reasoning paths, tree-based search, and iterative refinement techniques. Recent reasoning LMs simplify test-time scaling by generating extended reasoning traces (longer chains-of-thought). Despite their results, these methods lack control over the length of the generated reasoning chains, resulting in suboptimal performance or unrealized potential efficiency gains. The presented work complements this line of research by enabling reasoning models to control the length of generated outputs, providing flexibility to calibrate inference compute based on task-specific requirements.
  • Length Control in LLMs: Controlling the length of LLM-generated outputs is an consideration across various generation tasks. Approaches include architectural modifications, training objective adjustments to enforce length constraints, or training models on instruction-style data labeled with desired output lengths. Previous works on length control largely fall into two use-case categories: reducing unnecessary verbosity and imposing maximum length budgets or achieving token-level length adherence. Existing methods focus on general-purpose text generation or instruction-following contexts, where cost-quality efficiency trade-offs are less critical or unaddressed. The presented work addresses the challenges present in reasoning models. Recent works emphasize generating shorter reasoning chains for efficiency, but they do not enable explicit length control or alignment with user-specified inference budgets. Another work introduces "budget-forcing" by imposing a token limit, but this strategy presents drawbacks. In contrast to these prior works, LCPO is designed to train reasoning-specialized models for precise and adaptive length control, where models learn to dynamically allocate inference compute based on constraints provided in a prompt.

Method

The paper presents LCPO, a method for conditioning a model on a target token length provided in the prompt. Given an input prompt xx and a target length ngoldn_{gold}, the model is expected to generate a response yy whose length nyn_y minimizes the absolute difference ngoldny|n_{gold} - n_y| while simultaneously producing the correct answer. This couples accuracy with output length, ensuring that the generated chain-of-thoughts adhere to user-specified constraints.

LCPO starts with a pre-trained reasoning LM LLMθLLM_{\theta} and a dataset D={(xi,ygold,i)}i=1ND = \{(x_i, y_{gold,i})\}_{i=1}^N, where each instance contains only the input prompt and the final answer. To enable length control, each prompt xix_i is augmented by appending a target length instruction.

xinew=Concat(xi,“Think for ngold,i tokens.”),x_i^{new} = \text{Concat}\Bigl(x_i,\, \text{``Think for } n_{gold,i} \text{ tokens.''}\Bigr),

where ngold,in_{gold,i} is sampled uniformly from Z(nmin,nmax)\mathbb{Z}(n_{min}, n_{max}). This augmentation yields a new dataset Dnew={(xinew,ygold,i)}i=1ND^{new} = \{(x_i^{new}, y_{gold,i})\}_{i=1}^N.

The authors then update LLMθLLM_{\theta} using a reinforcement learning objective. In the experiments, they adopt GRPO, though the method is compatible with other RL algorithms. The reward function combines two terms: a correctness reward rcr_c and a length penalty rlengthr_{length}. It is defined as

r(y,ygold,ngold)=I(y=ygold)αngoldnyr(y, y_{gold}, n_{gold}) = \mathbb{I}(y = y_{gold}) - \alpha \cdot \bigl|n_{gold} - n_y\bigr|,

where:

  • I()\mathbb{I}(\cdot) is the indicator function
  • nyn_y is the generated output length
  • α\alpha is a scalar that regulates the trade-off between generating the correct answer and meeting the target length.

A lower value of α\alpha prioritizes correctness, whereas a higher value enforces stricter adherence to the length constraint.

At inference, the output length is controlled by selecting a fixed target length ngoldn_{gold} (or a set of lengths) that is appended uniformly to every test prompt.

The authors further train a variant called #1-Max, which generates outputs of varying lengths while respecting a maximum length constraint. To train #1-Max, the authors fine-tune the #1-Exact model using the same RL framework but with a modified reward function:

r(y,ygold,ngold)=I(y=ygold)clip(α(ngoldny)+δ,0,1)r(y, y_{gold}, n_{gold}) = \mathbb{I}(y = y_{gold}) \cdot \text{clip}(\alpha \cdot (n_{gold} - n_y) + \delta, 0, 1),

where α\alpha controls the penalty for length violations. This formulation applies a soft constraint that gradually penalizes outputs exceeding the target length rather than imposing a hard cutoff (which is necessary to ensure gradient propagation in GRPO objective) and incentivizes the model to use fewer tokens when possible without sacrificing correctness. The δ=0.5\delta = 0.5 term ensures that correct answers with minor budget violations are still preferred over incorrect answers.

#1-Max is trained with dual objective: when the prompt requests an exact length, the model uses the first reward function; otherwise, it defaults to the maximum constraint mode using the second reward function.

Experimental Setup

The authors conduct training on the DeepScaleR-Preview-Dataset, a mathematics dataset consisting of 40K question-answer pairs drawn from AIME, AMC, Omni-Math and STILL. The models are evaluated on test sets of 4 different reasoning datasets: AIME 2025, MATH, AMC, Olympiad-Bench, and additionally GPQA, LSAT, and MMLU.

The base model is DeepScaleR-1.5B-Preview, a 1.5B-parameter model originally RL fine-tuned on this dataset with a 24K token context length. Due to compute constraints, the authors restrict the maximum context length to 4K tokens during training and to 8K tokens during evaluation. The model is further fine-tuned for 700 steps with LCPO-Exact objective, and the resulting model is referred to as #1-Exact. The model is further RL finetuned for 120 steps with the second objective, and the resulting model is referred to as #1-Max.

The proposed method is evaluated against the following baselines:

  • DeepSeek-R1-Distill-Qwen-1.5B: the SFT version of Qwen-2.5-1.5B-Instruct finetuned on reasoning traces of DeepSeek's R1 model.
  • DeepScaleR-1.5B-Preview: the original model, evaluated without any length control modifications.
  • DeepScaleR-1.5B-Preview-4K: a version of Agentic-24K fine-tuned with 4K context length.
  • S1: a budget-forcing method, which controls reasoning length using test-time interventions.

The approaches are evaluated along two dimensions:

  1. The model's ability to adhere to the targeted length by reporting the mean deviation between the generated token length nyn_y and the target ngoldn_{gold}.
  2. The problem-solving accuracy when generating responses at different target lengths.

In the experiments, target lengths are selected from {512,1024,2048,3600}\{512, 1024, 2048, 3600\} tokens.

For GRPO training, the authors adopt the same hyperparameters as in DeepScaleR-1.5B-Preview. In particular, they use a learning rate of 1e-6 and a batch size of 128. The maximum context length is set to 4K tokens at training time and extended to 8K tokens during evaluation. Training is performed for 700 steps using the VeRL framework.

During training, the target length ngoldn_{gold} is sampled uniformly from U(nmin,nmax)U(n_{min}, n_{max}), where nmin=100n_{min}=100 and nmax=4000n_{max}=4000. The balancing parameter α\alpha in the first equation is fixed at 0.0003.

Results and Analysis

The paper reports and analyzes the effectiveness of the proposed method (LCPO) across various settings and benchmarks, and evaluates the method's relative performance, generalization capability on out-of-domain tasks, controllability of length constraints and competitive performance in short CoT setups, and examines learned reasoning behaviors.

  • #1 achieves superior performance across all token budgets while maintaining length control. Compared to S1, the only other method designed for length control, #1 shows improvements, over 100-150\% relative and 20-25\% absolute performance gains at both 512 and 1024 token budgets. With #1, the authors observe a log-linear scaling pattern, performance improves linearly with respect to the log-length of generated reasoning chains. However, this scaling curve for #1 exhibits a smaller slope (0.24 vs. 0.37 slope of S1), indicating effectiveness at lower token ranges. #1-Exact performs approximately 1\% below Agentica-4K, which is the same underlying model as #1, but trained without length constraints. #1-Max matches the performance of Agentica-4K by optimizing token usage based on problem difficulty while respecting the upper ceiling.
  • #1 generalizes to new domains: performance scales positively with token budget for OOD (out-of-domain) general reasoning datasets, approaching or matching Agentica-4K benchmarks despite length control constraints. For GPQA and LSAT, the same linear performance scaling trend is observed, with #1 matching Agentica-4K's performance at token budgets. For MMLU, there is a less pronounced linear scaling relationship (R2=0.66R^2 = 0.66), because these knowledge-focused questions benefit less from extended reasoning.
  • #1 follows length constraints across mathematical reasoning datasets. The model maintains consistent control across all token budgets (512, 1024, 2048, and 3600 tokens), with observed output lengths closely matching the requested lengths. The mean error is close to 3\% for all math reasoning datasets. Although OOD datasets exhibit higher errors (20-40\%), these remain preferable over uncontrolled prompting.
  • When compared to both its base non-reasoning model (Qwen-2.5-1.5B-Instruct) and significantly larger non-reasoning models (GPT-4o and Llama-3.3-70B) at comparable generation lengths, #1 consistently outperforms or matches all models across all datasets despite using equivalent token budgets. On average, #1 is 5\% better than its non-reasoning counterpart, and even outperforms GPT-4o by 2\% on average.
  • An analysis of how frequently reasoning-related terms appear in outputs of different lengths shows that self-correction and verification keywords appear approximately twice as frequently in 4096-token outputs compared to 512-token outputs. Similarly, conclusion-drawing terms increase 2-10x with increased token budget. Most exploration-related keywords decrease in relative frequency at higher token counts, with ``Alternatively'' being a notable exception. Smaller CoTs have reasoning patterns similar to their longer counterparts, but with changed relative frequencies that favor more self-verification and conclusion drawing in longer chains-of-thought.
Youtube Logo Streamline Icon: https://streamlinehq.com