Effective test-time scaling for large language model reasoning

Determine principled and effective strategies for test-time scaling—i.e., increasing inference-time compute during generation—to improve reasoning performance in large language models across tasks.

Background

The paper discusses recent advances such as OpenAI’s o1 models, which improved reasoning by increasing the length of Chain-of-Thought at inference time (inference-time or test-time scaling). Despite these gains, the authors note that identifying broadly effective and reliable methods for test-time scaling remains unresolved.

They survey related approaches (process-based reward models, reinforcement learning, and search methods like Monte Carlo Tree Search and Beam Search) and report that none yet achieve the general reasoning performance of the o1 series, motivating their reinforcement-learning-based exploration with DeepSeek-R1 and DeepSeek-R1-Zero.

References

However, the challenge of effective test-time scaling remains an open question for the research community.

— DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2501.12948 - DeepSeek-AI et al., 22 Jan 2025) in Section 1, Introduction

Effective test-time scaling for large language model reasoning

Background

References

Related Problems