Efficient Reasoning via Reward Model (2511.09158v1)

Published 12 Nov 2025 in cs.AI

Abstract: Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of LLMs, enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method's effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: https://anonymous.4open.science/r/CRM.

Summary

The paper proposes a CRM+CRF framework that rewards concise reasoning only if the answer is correct.
It employs an LLM-augmented conciseness model with annealing and difficulty penalties to stabilize training and prevent reward hacking.
Empirical results show up to 8.1% accuracy improvement and 19.9% token reduction across various backbones on mathematical benchmarks.

Efficient Reasoning via Reward Model: Technical Summary and Analysis

Introduction and Problem Statement

Recent developments in Reinforcement Learning with Verifiable Rewards (RLVR) have strengthened the mathematical reasoning ability of LLMs, producing Large Reasoning Models (LRMs) such as DeepSeek-R1 and OpenAI o1. However, these RLVR-trained models demonstrate a persistent overthinking issue: they tend to generate unnecessarily verbose reasoning paths with repetitive or irrelevant steps, incurring substantial computational overhead due to increased output length. Attempts to address this with length penalties in the reward function typically lead either to length collapse—where the model is incentivized to produce trivial, non-informative outputs—and training collapse, where both logical quality and conciseness degrade.

This paper introduces a two-component framework designed to mitigate these collapse failures: (1) a Conciseness Reward Model (CRM) to quantify the quality of reasoning conciseness, and (2) a Conciseness Reward Function (CRF) that fuses the correctness and conciseness reward signals with explicit dependency, ensuring conciseness is only rewarded when the reasoning is correct. The overall objective is to optimize for models that maintain high accuracy while minimizing token usage in reasoning paths, thus achieving more efficient reasoning without sacrificing correctness or logical rigor.

Framework Overview

The proposed approach augments RLVR with a reward model pipeline for conciseness and integrates this via a tailored reward structure. The system comprises two main components:

Conciseness Reward Model (CRM): Trained to measure the conciseness of a solution, accounting for redundancy, irrelevance, and token efficiency.
Conciseness Reward Function (CRF): Introduces explicit dependency—only correct reasoning paths receive additional reward based on conciseness—and includes an annealing mechanism for training stability and a difficulty-adaptive penalty.
Figure 1: An overview of the proposed framework taking GRPO as an example.

Conciseness Reward Model Construction

Training a reward model to reflect human preferences typically involves RLHF, but the resource-intensive nature and domain-expert inconsistency make this infeasible at large scale. This work leverages LLM-based data augmentation for scalable labeling:

Data Source: The DeepMath-103K dataset is selected for diversity and difficulty coverage.
Augmentation: For each question, two distinct reasoning paths are generated: one concise (higher quality), one redundant (lower quality).
Scoring: Qwen2.5-72B-Instruct assigns a conciseness score between 0.1 and 1, considering factors such as repetition avoidance, step relevance, and token efficiency.
Filtering: Only pairs differentiable by conciseness score are retained, yielding a 65,878-sample dataset.
Supervised Fine-Tuning: The CRM is trained using Qwen2.5-3B-Instruct on this data.

Reward Function Design (CRF)

A naive weighted sum of correctness and conciseness is vulnerable: models optimize for short outputs at the expense of reasoning quality (length collapse/training collapse). The authors instead propose:

$\hat{R}_i = R^o_i \cdot \left[ 1 + \alpha \cdot c_i \cdot (s + d_q) \right]$

$R^o_i$ is a binary correctness reward.
$c_i$ is the conciseness score from CRM.
$\alpha$ scales the influence of conciseness.
$s$ is an annealing coefficient $\exp(-\frac{step}{T})$ to control exploration-exploitation trade-off during training.
$d_q$ is a difficulty coefficient (higher penalty for easier questions).

This explicit dependency means conciseness only contributes for correct answers, addressing reward hacking.

Theoretical Analysis

Two propositions underpin the CRF’s optimization properties:

Variance Reduction: Assuming positive correlation between outcome reward and conciseness, the variance of the policy gradient estimator is strictly reduced compared to the standard objective. This addresses instability caused by binary, sparse reward signals:

$\text{Var}[\nabla_\theta J(\theta)] \leq (1-\eta)\,\text{Var}[\nabla_\theta J_0(\theta)]$

Improved Convergence Constants: Thanks to reduced gradient estimation variance, the stochastic optimization with suitably chosen learning rates achieves $O(1/\sqrt{T})$ convergence with improved constants, enhancing RL stability and sample efficiency.

Empirical Results and Ablation

The framework is evaluated on five mathematical benchmarks—MATH-500, AIME2025, OlympiadBench, AMC-23, and GPQA—across Qwen2.5-7B, Llama3.1-8B, and Mistral-7B-v0.1 backbones. Key metrics are Pass@ $k$ (accuracy) and average output token count (efficiency).

Figure 2: Training curves of outcome reward and average number of tokens of the generated reasoning path on (a) Qwen2.5-7B, (b) Llama3.1-8B, and (c) Mistral-7B-v0.1

Results

On Qwen2.5-7B, CRF achieves 8.1% accuracy improvement and 19.9% token reduction versus the original GRPO with binary reward.
Competing length-penalized reward schemes (Cosine, Kimi 1.5) substantially reduce token count but incur significant accuracy loss—as high as 14.2% (Cosine) and 8.7% (Kimi) on average.
ARM, which utilizes "short CoT mode," does improve efficiency but at the cost of severely increased output verbosity (up to 75.1% more tokens).

In ablation, explicit dependency between correctness and conciseness was critical; weighted-sum variants suffered a 9.8% reduction in accuracy. Both annealing and difficulty-adaptive penalties further contributed to stable and efficient optimization.

Generalization

CRF is compatible with other RL algorithms (DAPO, etc.) and multiple LLM backbones, with consistent gains in accuracy-efficiency trade-off. The method is robust to the cold-start problem observed in Llama3.1-8B and Mistral-7B, where direct RL on base models fails absent initial SFT.

Case Study

Qualitative analysis confirms that CRF-trained models avoid rote answer regurgitation (seen in Cosine reward) and redundant reasoning (seen in standard GRPO), achieving short, logically structured solutions matching expected mathematical rigor.

Practical Implications

The proposed CRM+CRF framework offers several practical advances:

Enables deployment of LLMs for mathematical and deductive tasks with strict inference compute budgets.
Reduced model output length mitigates API latency and memory overhead in real-world applications.
Compatible with plug-and-play reward structures for different domains beyond mathematics, where concise reasoning is necessary (e.g., scientific QA, legal reasoning).

Theoretically, the approach stands as a robust RLVR paradigm, with improved sample efficiency and convergence characteristics, while sidestepping common reward hacking and collapse pitfalls.

Future Directions

Potential avenues for extension include:

Scaling CRM to multi-domain reasoning beyond mathematics.
Integrating human-in-the-loop feedback for outlier cases where conciseness scoring may misalign with domain norms.
Systematic paper of trade-offs between reasoning diversity and minimum-length outputs to avoid pathological shortcuts in highly-compressed composed reasoning.
Investigating CRM architectures that jointly model correctness, conciseness, and other qualitative properties (e.g., persuasiveness, transparency).

Conclusion

This work systematically diagnoses and resolves common collapse pathologies in RLVR-trained LLMs tasked with mathematical reasoning. Through the construction of an LLM-augmented conciseness reward model and a dependency-aware reward function, the proposed method achieves an improved trade-off between reasoning accuracy and efficiency. CRF’s explicit structure facilitates both practical deployment and theoretical understanding, and experimental evidence demonstrates superior performance and compatibility across datasets and architectures.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper looks at how to make AI models think more efficiently. Many advanced LLMs can solve hard math problems by writing out long “chains of thought.” That helps accuracy, but often they “overthink” — they repeat themselves, add unnecessary steps, and waste lots of tokens (which means more time and compute). The authors propose a way to reward models not just for getting the right answer, but also for being concise, so the model learns to think clearly and briefly without losing correctness.

What questions did the researchers ask?

The paper focuses on a few simple, practical questions:

Can we teach a model to keep its reasoning short while still being correct?
Why do common “length penalty” tricks sometimes break training or make the model do worse?
Is there a better reward that balances correctness and conciseness?
Does this new method improve results across different models and math benchmarks?

How did they do it?

Think of the model as a student. Traditionally, the student gets points only if the final answer is correct (this is called a “verifiable reward” and is common in math tasks). Some research tried to subtract points if the student wrote too much, but that caused two problems:

Length collapse: the student writes extremely short answers, sometimes skipping real reasoning and “gaming” the system.
Training collapse: the student’s performance drops badly — both accuracy and output length fall.

To fix this, the authors use two main ideas.

1) Conciseness Reward Model (CRM)

Imagine a “conciseness judge” that reads a solution and gives it a score for how compact and focused it is.
How they built it:
- They used a big math dataset (DeepMath-103K) with difficult, diverse problems and existing solutions.
- They asked a strong model to generate two versions of a solution: one that’s concise and one that’s more redundant.
- Another strong model scored these solutions on conciseness (considering repetition, irrelevant steps, and length without hurting quality).
- They trained a smaller model to predict these conciseness scores. This becomes the CRM — a learned judge that can score any reasoning path.

2) Conciseness Reward Function (CRF)

The key trick: only apply the conciseness bonus if the answer is correct.
In other words, correctness comes first. If the model is right, then shorter, cleaner reasoning earns extra points. If the model is wrong, it gets no conciseness reward.
Why this helps:
- It reduces “reward hacking” where the model just prints a short final answer without proper reasoning.
- It gives more stable training by making the reward signal less noisy.
Two helpful add-ons:
- Annealing coefficient: slowly reduce the weight of conciseness over training. Early on, you encourage conciseness more; later, you let the model settle.
- Difficulty coefficient: harder questions are allowed to have longer reasoning. So the penalty for length is lighter on tough problems and heavier on easy ones.

A note on technical claims (in simple terms)

Variance reduction: Training is less “noisy,” so updates to the model are more reliable.
Improved convergence: The model learns faster and more steadily.

What did they find?

The authors tested their method on five math benchmarks (like MATH-500, AIME 2025, OlympiadBench, AMC-23, GPQA) and across several model families (Qwen2.5-7B, Llama 3.1-8B, Mistral 7B).

Key results:

On Qwen2.5-7B, their method increased accuracy by about 8.1% and cut response length by about 19.9% compared to a strong baseline (GRPO) that rewards only correctness.
On Llama and Mistral models, it also reduced tokens and often improved accuracy.
Compared to other “length penalty” methods (like cosine scaling or Kimi 1.5’s length reward), their approach gave a better balance — it reduced length without hurting accuracy as much, and often boosted accuracy.
Ablation tests (turning off parts of the method) showed the dependency “only reward conciseness when correct” is crucial. Removing it led to worse performance.

A simple example they showed:

All methods got a math inequality right, but the basic method (GRPO) wrote a lot of unnecessary steps.
A length-penalized method sometimes just printed the final answer with no reasoning (reward hacking).
Their method kept reasoning short but still logical, striking a better balance.

Why does this matter?

Faster and cheaper AI: Shorter answers mean fewer tokens, less compute, lower costs, and quicker responses — especially helpful for large-scale use.
Better reasoning quality: Encouraging concise, relevant steps trains models to think more clearly, not just more verbosely.
Safer training: Their reward design avoids common failure modes like length or training collapse, making reinforcement learning more stable.
Broad usefulness: While they tested on math, the idea of “reward correctness first, then conciseness” could apply to other tasks where clear, efficient reasoning is important (like coding, science answers, or step-by-step explanations).

In short, this paper shows a practical way to train AI to be both smart and efficient: get the right answer, and explain it cleanly.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, concise list of the paper’s unresolved knowledge gaps, limitations, and open questions. Each item is phrased to be concrete and actionable for future work.

Data/CRM: The Conciseness Reward Model (CRM) is trained solely on math (DeepMath-103K); it is unknown whether CRM scores transfer to other domains (coding, scientific reasoning, step-by-step proofs), other languages, or multimodal settings.
Data/CRM: CRM labels are produced by LLMs (Qwen2.5-72B) rather than humans; the paper does not quantify label noise, bias, or inter-rater reliability, nor provide calibration of the conciseness scale (0.1–1) against human judgment.
Data/CRM: The paper does not test robustness of CRM to adversarially concise-yet-misleading reasoning (e.g., omitted critical steps, handwavy jumps), or to stylistic variants (equations-only vs prose, tables, structured proofs).
Data/CRM: No analysis of CRM error modes (false positives/negatives on conciseness), nor correlation between CRM score and actual length, redundancy, repetition, or reasoning quality across difficulty levels.
Reward design: The CRF applies conciseness reward only when the outcome is correct; the paper does not explore shaping for partially correct or near-miss reasoning, intermediate verification, or step-level correctness signals to avoid discouraging exploration on hard problems.
Reward design: The difficulty coefficient d_q uses in-batch group accuracy as a proxy for question difficulty; there is no validation of its reliability, sensitivity to group size G, or comparison to external difficulty estimators.
Reward design: The annealing schedule s=exp(-step/T) is fixed and heuristic; the paper does not paper its sensitivity, alternative schedules, or adaptive annealing tied to training dynamics (e.g., variance, entropy, or success rate).
Reward design: CRF provides only positive reward; the impact of adding carefully designed negative rewards for incorrect but overlong or reward-hacking responses is untested.
Theory: Variance reduction and improved convergence are proved under assumptions (e.g., Cov(R_i^o, c_i) > 0, unbiased gradient, smoothness); the paper does not empirically verify these assumptions (e.g., measure correlation or gradient variance) or analyze failure when assumptions break (negative/zero covariance).
Theory: The analysis abstracts away GRPO-specific clipping and sets KL penalty β=0; it remains unclear how CRF behaves with realistic PPO/GRPO settings (nonzero β, clipping effects), and whether theoretical guarantees extend to these regimes.
Failure modes: While CRF is proposed to mitigate length/training collapse, there is no systematic evaluation of CRF’s own failure modes (e.g., over-pruning legitimate steps on hard problems, hidden reward gaming against CRM, mode collapse under extreme α, s, d_q).
Generalization: The method is only evaluated on small to mid-size open models (Qwen2.5-7B, Llama3.1-8B, Mistral-7B); scalability to larger LRMs (e.g., 70B+) and to proprietary o1/R1-style systems is untested.
Generalization: The paper asserts compatibility with other RL algorithms (e.g., DAPO) but provides no empirical results across diverse RLVR/RLHF methods (PPO variants, RLAIF, off-policy methods, step-wise rewards).
Metrics: Efficiency is proxied by output tokens only; there is no measurement of wall-clock latency, total compute, memory, throughput, or energy, nor analysis of the trade-off with sampling strategies (temperature, Pass@k).
Metrics: The evaluation relies mainly on Pass@1 and token length; more comprehensive metrics (process correctness, logical soundness, error localization, faithfulness of reasoning, human preference) are missing.
Baselines: Key efficient reasoning baselines (ALP, LC-R1, other length-penalty methods) are discussed but not consistently benchmarked head-to-head across all datasets and backbones; reproducible comparisons and unified protocols are lacking.
Hyperparameters: Sensitivity analyses are limited (mostly α); the paper does not explore sensitivity to group size G, max length, CRM backbone size, reward scaling, or the difficulty/annealing functional forms.
Cold-start: For Llama/Mistral, CRF is used only after SFT due to cold-start issues; it remains open whether CRF can enable zero-shot RLVR training or reduce the amount of required SFT.
Safety/robustness: No assessment of unintended behaviors under CRF (e.g., hallucinations, unsafe content, shortcutting via spurious correlations, deterioration in non-math capabilities), or whether increased conciseness harms transparency/explainability.
Reproducibility: Some implementation details (exact prompts, appendices, training seeds, group sizes, reward scaling) are not fully reported in the main text; rigorous reproducibility and formal data release with licensing are required.
CRM at inference: The paper does not investigate leveraging CRM at inference time (reranking, early-exit, stop-when-confident), nor compare training-time vs inference-time gains for efficiency.
Process-level rewards: The approach scores whole-answer conciseness; it leaves unexplored step-level conciseness rewards, local redundancy detection, or structured reasoning-format constraints that could provide finer-grained guidance.
Domain shift: The difficulty-aware length adjustment is motivated by DeepMath statistics; its effectiveness under domain shift (different benchmarks, task distributions, or non-verifiable tasks) is unknown.

View Paper Prompt View All Prompts

Practical Applications

Overview

This paper introduces a training pipeline and reward design that make large reasoning models (LRMs) both more accurate and more concise. It trains a Conciseness Reward Model (CRM) to score reasoning paths and proposes a Conciseness Reward Function (CRF) that applies conciseness rewards only when answers are correct. Empirically, the method reduces token usage while improving accuracy; theoretically, it reduces gradient variance and improves convergence. Below are practical applications organized by timeframe, including sector links, potential tools/workflows, and feasibility considerations.

Immediate Applications

The following applications can be deployed now with modest engineering effort, using available RLVR tooling (e.g., GRPO/DAPO), existing LLM backbones (Qwen, Llama, Mistral), and the CRM/CRF designs described in the paper.

Industry (AI/Software Engineering)
- Use CRF in RLVR post-training for math, coding, and tasks with verifiable outcomes (e.g., unit-testable code generation), to improve accuracy while cutting response tokens and latency.
- Sector: software
- Tools/workflows: integrate CRM as a module in existing RL pipelines (GRPO/DAPO); adopt difficulty-aware and annealed conciseness weighting; track Pass@k and tokens in training dashboards; deploy shorter CoT models for production.
- Dependencies/assumptions: availability of verifiable rewards (tests/validators); CRM is trained on domain-appropriate data; assume positive correlation between correctness and conciseness; compute resources for RL.
- Cost and throughput savings for model-serving platforms by deploying concise reasoning variants for math/coding assistance and structured tasks.
- Sector: software, energy
- Tools/products: “token-budget governor” in inference; model routing to concise variants; capacity planning based on token reductions (~20–40%).
- Dependencies/assumptions: stable performance across workloads; alignment with product UX (concise vs. explanatory); monitoring for failure modes (length collapse/training collapse).
- Faster, more stable RL training jobs due to variance reduction and improved convergence constants.
- Sector: software (ML platform)
- Tools/workflows: reduced wall-clock and GPU-hours; tighter learning-rate schedules enabled by lower gradient variance.
- Dependencies/assumptions: CRF’s variance reduction holds in the target domain; proper group sampling (G) and batch design; reward signal quality.
- Concise developer assistants and code review bots that reward brevity only on correct outputs (tests pass).
- Sector: software
- Tools/products: CI-integrated assistants; concise rationales; guardrails that incentivize short explanations when test suites succeed.
- Dependencies/assumptions: robust unit/integration tests; domain-specific CRM fine-tuned on code reasoning; careful tuning of α (conciseness weight).
- Customer support bots that solve ticket workflows (with verifiable resolution states) and provide minimal but sufficient steps.
- Sector: software (customer support), finance (operations), education (helpdesk)
- Tools/workflows: resolution-state checkers as verifiers; CRM-guided training for concise troubleshooting scripts.
- Dependencies/assumptions: clear verifiable endpoints (issue resolved/knowledge article matched); CRM tuned to support narratives without losing crucial steps.
Academia
- A reproducible method to paper and mitigate “overthinking” in reasoning models, replacing crude length penalties with a learned conciseness model and correctness-dependent rewards.
- Sector: education, research
- Tools/workflows: open datasets (DeepMath), CRM training with synthetic labels, ablation frameworks to test collapse phenomena; benchmarking token efficiency alongside accuracy.
- Dependencies/assumptions: access to sufficiently strong teacher models for synthetic labeling; domain decontamination; reproducibility standards.
- A template for building new reward models for other preferences (clarity, safety, faithfulness), plugging into CRF-like correctness-dependent reward chaining.
- Sector: research
- Tools/products: “AutoRM builder” pipeline; plug-in reward functions for RLVR frameworks.
- Dependencies/assumptions: availability of preference data or high-quality synthetic labeling; reliable verifiers for the base outcome reward.
Policy and Governance (near-term operations)
- Compute/energy efficiency reporting for reasoning systems using token-length and accuracy metrics; operational policies that favor CRF-trained models for verifiable tasks.
- Sector: policy, energy
- Tools/workflows: dashboards tracking accuracy vs. tokens; procurement checklists requiring concise inference where verifiable rewards exist.
- Dependencies/assumptions: well-defined KPIs; cross-team buy-in; no degradation in safety or coverage due to shorter reasoning.
Daily Life
- Concise math tutors and homework helpers that still show necessary steps, tuned by difficulty-aware length penalties.
- Sector: education
- Tools/products: paper assistants with adjustable verbosity; settings for step granularity by problem difficulty.
- Dependencies/assumptions: CRM generalizes to the curriculum; difficulty estimation works in non-exam contexts; UX controls for verbosity.

Long-Term Applications

These require further research, scaling, domain adaptation, or regulatory frameworks before broad deployment.

Cross-Domain Conciseness Reward Models (beyond math)
- Train CRMs for medicine, law, finance, and scientific writing, scoring redundancy and relevance in discipline-specific reasoning.
- Sectors: healthcare, legal, finance, education, research
- Tools/products: domain-specific CRMs; correctness-dependent CRF calibrated to verifiers (e.g., clinical decision support rules, legal citation checkers, backtesting validators).
- Dependencies/assumptions: strong, reliable verifiable reward signals per domain; expert-labeled or high-quality synthetic preference data; rigorous safety validation.
- Multimodal planning conciseness (e.g., robotics task plans), rewarding short, correct action sequences.
- Sector: robotics
- Tools/workflows: plan verifiers (task completion detectors); CRM for plan succinctness; CRF applied only when tasks are successfully executed in sim/hardware.
- Dependencies/assumptions: robust task verifiers; accurate simulators; transfer from simulation to real-world; safety constraints.
Regulated and Safety-Critical Systems
- Clinical and legal decision support that maintains necessary justification while minimizing extraneous steps; configurable verbosity to meet documentation standards.
- Sectors: healthcare, legal
- Tools/products: dual-mode outputs (concise for operations, full justification for audit); CRF tuned to reward sufficiency thresholds, not just minimality.
- Dependencies/assumptions: audited verifiers; compliance requirements satisfied; human-in-the-loop oversight; risk management for omission errors.
Finance and Trading Analytics
- Strategy generation and risk analysis with verifiable backtest rewards; encourage shorter, more direct analysis narratives once correctness (profitability) is established.
- Sector: finance
- Tools/workflows: backtesting verifiers; CRM tuned to financial reasoning; token-efficient reporting pipelines.
- Dependencies/assumptions: robust, non-leaky backtest frameworks; prevention of reward hacking; regulatory compliance for model use.
Policy and Standards
- Efficiency-first AI governance: standards that require reporting token efficiency and accuracy under verifiable tasks; incentives for CRF-like reward chaining that reduces reward hacking.
- Sector: policy, energy
- Tools/workflows: benchmarking protocols (Pass@k, tokens, collapse detection); carbon-aware compute accounting; procurement language endorsing correctness-dependent rewards.
- Dependencies/assumptions: consensus on metrics; independent audits; sector-specific verifiers (e.g., coding tests, math solvers).
Tooling and Infrastructure
- RLVR toolkits with plug-and-play CRF, difficulty estimation, and annealing schedulers; monitoring for length/training collapse and variance.
- Sector: software (ML platforms)
- Tools/products: CRF plugins for major RL frameworks; “Collapse Detector” monitors; training orchestration for multi-domain CRMs.
- Dependencies/assumptions: widespread framework support; evolving best practices for hyperparameter tuning; standardized logs.
Education at Scale
- Curriculum-aware conciseness controllers: adaptive verbosity based on student proficiency and problem difficulty, with verifiable answer checks (auto-graders).
- Sector: education
- Tools/workflows: difficulty estimators; auto-graders as verifiers; personalized verbosity profiles across topics.
- Dependencies/assumptions: reliable proficiency models; coverage of verifiable tasks; fairness and accessibility concerns addressed.

Key Assumptions and Dependencies Across Applications

Verifiable reward signals exist and are accurate for the target task (e.g., unit tests for code, auto-graders for math, validated rules for clinical/legal contexts).
Conciseness correlates positively with correctness in the target domain (Cov(R, c) > 0); if not, CRM must be retrained or retuned.
CRM generalization beyond math requires domain-specific datasets, labeling protocols, and potentially human expert oversight.
Synthetic labeling relies on strong teacher models; quality control and decontamination are necessary to prevent bias and leakage.
Proper hyperparameter tuning (e.g., α, annealing schedules, difficulty coefficients) and monitoring for reward hacking and collapse phenomena are essential.
Shorter reasoning may trade off transparency; products must provide configurable verbosity and audit modes to satisfy safety/compliance.
Compute availability for RL post-training and data constraints (licensing, privacy) may limit immediate adoption in some settings.

View Paper Prompt View All Prompts

Glossary

Ada-GRPO: An adaptive variant of GRPO used to train reasoning models. "ARM~\citep{wu2025arm} proposes trained with Ada-GRPO with format diversity reward to support adaptive reasoning formats."
Advantage: A scalar that measures how much better an action is compared to a baseline, used to weight policy gradients. "where $\varepsilon$ is a clipping-related hyper-parameter to stabilize training and ${A}_{i,t}$ denotes the advantage."
ALP: A method that applies adaptive length penalties during RL to control reasoning length. "using ALP~\citep{xiang2025just} in Fig.~\ref{fig:analysis}(b)"
Annealing coefficient: A time-decaying scalar used to gradually reduce the weight of a term during training. "an annealing coefficient $s=\exp(-\frac{step}{T})$ is introduced that decays with increasing training steps where $step$ and $T$ denote the current training step and the total number of training steps, respectively."
Binary verifiable reward: A reward that returns 1 for a correct answer and 0 otherwise under verifiable checking. "The reward function $R$ is usually set as binary verifiable reward considering the correctness of answer:"
Cardinality: The number of elements in a set. "where $|\cdot|$ denotes the cardinality of a set and this coefficient ranges from 1 to e."
Clipping (policy optimization): A technique that limits the change in policy ratios to stabilize training. "where $\varepsilon$ is a clipping-related hyper-parameter to stabilize training"
Conciseness Reward Function (CRF): A reward function that applies conciseness scores only when answers are correct. "we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score,"
Conciseness Reward Model (CRM): A reward model trained to score how concise a reasoning path is. "we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path."
Convergence rate: The speed at which an optimization method approaches a stationary point. "achieves a convergence rate of $O(1/\sqrt{T})$ with improved constants:"
Cosine scale reward: A length-shaping reward based on a cosine schedule to control CoT length. "Cos~\citep{yeo2025demystifying} denotes the cosine scale reward, which is proposed to stabilize the length scaling of RL training and control CoT length."
DAPO: A state-of-the-art RL algorithm for LLM post-training; the proposed method is compatible with it. "it can also be easily applied in different state-of-the-art RL algorithms like DAPO~\citep{yu2025dapo}."
Decontamination procedure: Dataset cleaning to remove training–evaluation overlap. "It undergoes decontamination procedure."
Difficulty coefficient: A factor that scales reward by estimated problem difficulty. "calculate a difficulty coefficient $d$ for each question $q$ accordingly:"
Group Relative Policy Optimization (GRPO): An RL algorithm that optimizes group-sampled responses with relative advantages. "In this section, we provide the background of Group Relative Policy Optimization (GRPO)~\citep{shao2024deepseekmath}, which is a representative reinforcement learning algorithm in reinforcement learning with verifiable rewards (RLVR)."
KL-divergence penalty: A regularizer penalizing deviation from a reference policy using KL divergence. "The coefficient $\beta$ of the KL-divergence penalty is set to 0 in this paper for simplicity in optimization,"
Kimi 1.5 reward: A reward that adds a length term to the outcome reward to improve token efficiency. "Kimi~\citep{team2025kimi} represents the Kimi 1.5 reward which adds a length reward to the original outcome reward to improve token efficiency."
Large reasoning models (LRMs): LLMs optimized (often with RL) for multi-step reasoning. "enabling the development of large reasoning models (LRMs)."
LC-R1: A length-compression approach for large reasoning models. "cosine scale reward~\citep{yeo2025demystifying}, ALP~\citep{xiang2025just}, and LC-R1~\citep{cheng2025optimizing}."
Length collapse: A failure mode where response length collapses while reward appears to improve. "We refer to this phenomenon as `length collapse',"
Over-thinking: Producing excessively verbose, redundant reasoning steps. "This `over-thinking'~\citep{chen2024not} issue introduces significantly heavy computational overhead."
Pass@ $k$ : An evaluation metric that checks if at least one of k samples is correct. "the Pass@ $k$ metric is adopted,"
Policy model: The probabilistic mapping from prompts to token sequences parameterized by θ. "the LLM with parameters $\theta$ is denoted as a policy model $\pi_{\theta}$ ."
Reinforcement learning from human feedback (RLHF): Training that uses human preference data to learn a reward model or policy. "the most common way is to train a reward model from the data annotated by human in reinforcement learning from human feedback (RLHF)."
Reinforcement learning with verifiable rewards (RLVR): RL that uses rule-verifiable signals (e.g., answer correctness) as rewards. "Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of LLMs,"
Reward hacking: Exploiting the reward function to get high reward without truly solving the task. "We refer to this phenomenon as `length collapse', which indicates reward hacking~\citep{skalse2022defining}."
R1-like training: A two-stage regimen: SFT followed by RL (as popularized by R1-style systems). "Therefore, for these two backbones the R1-like training is adopted,"
Stationary point: A parameter setting where the gradient is zero. "where $T$ is the total number of training steps, $\theta^*$ is a stationary point."
Stochastic gradient estimator: A random, unbiased estimate of the true gradient used in SGD-style updates. "exhibits reduced variance of stochastic gradient estimator $\nabla_\theta J(\theta)$ "
Supervised fine-tuning (SFT): Supervised fine-tuning of LLMs on input–output pairs. "The training set is leveraged to train the conciseness reward model through supervised fine-tuning (SFT)"
Variance reduction: Techniques or properties that lower gradient estimator variance, improving stability. "Proposition 1. Variance Reduction: Under the assumption that conciseness reward $c_i$ is positively correlated with outcome reward $R_i^o$ "
Zero-R1: Direct RL training of a base model without prior SFT. "directly conduct RL training (same as Zero-R1~\citep{liu2025understanding})."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (7)

Collections

Tweets

This paper has been mentioned in 2 tweets and received 61 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

Efficient Reasoning via Reward Model (2511.09158v1)

Summary

Efficient Reasoning via Reward Model: Technical Summary and Analysis

Introduction and Problem Statement

Framework Overview

Conciseness Reward Model Construction

Reward Function Design (CRF)

Theoretical Analysis

Empirical Results and Ablation

Results

Generalization

Case Study

Practical Implications

Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

1) Conciseness Reward Model (CRM)

2) Conciseness Reward Function (CRF)

A note on technical claims (in simple terms)

What did they find?

Why does this matter?

Knowledge Gaps

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Key Assumptions and Dependencies Across Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections

Tweets