LLM Post-Training: A Deep Dive into Reasoning Large Language Models (2502.21321v2)

Published 28 Feb 2025 in cs.CL and cs.CV

Abstract: LLMs have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.

Summary

The paper surveys comprehensive post-training techniques for LLMs, categorizing them into fine-tuning, reinforcement learning, and test-time scaling to improve reasoning abilities.
Post-training refines LLMs beyond pre-training, focusing on enhancing reasoning, factual accuracy, and alignment with complex tasks and user intents.
Key techniques discussed include parameter-efficient fine-tuning (like LoRA), various RL algorithms (such as DPO), and advanced scaling methods (like Chain-of-Thought and Tree-of-Thought).

The survey paper presents a comprehensive analysis of post-training methodologies for LLMs, categorizing these techniques into fine-tuning, Reinforcement Learning (RL), and test-time scaling. It addresses the shift in focus from pre-training on extensive datasets to refining LLMs through post-training to enhance reasoning, factual accuracy, and alignment with user intents and ethical standards.

The paper highlights the significance of fine-tuning in adapting pre-trained LLMs to specific tasks, using parameter-efficient techniques like Low-Rank Adaptation (LoRA) [hu2021lora] and adapters to mitigate overfitting and computational costs. While fine-tuning improves performance in sentiment analysis, question answering, and medical diagnosis, it faces challenges such as out-of-domain generalization.

It explores RL's role in refining LLMs, optimizing sequential decision-making using dynamic feedback, and improving adaptability. The paper contrasts conventional RL with RL in LLMs, emphasizing the complexities of vast vocabulary selection, state representation as growing text sequences, and reliance on subjective feedback. It also touches on hybrid approaches that combine process-based rewards (e.g., chain-of-thought reasoning) and outcome-based evaluations to refine learning.

Scaling techniques are crucial for enhancing LLM performance and efficiency. The survey discusses methods like Chain-of-Thought (CoT) [wei2022chain] reasoning and Tree-of-Thought (ToT) [yao2024tree] frameworks, which improve multi-step reasoning by decomposing complex problems into sequential steps. It also addresses the use of Low-Rank Adaptation (LoRA) [hu2021lora], adapters, and Retrieval-Augmented Generation (RAG) [asai2023self, gao2023retrieval] to improve computational efficiency and factual accuracy. The survey also discusses test-time scaling, which dynamically adjusts parameters based on task complexity.

The paper contrasts its approach with prior surveys, noting that earlier works often focus on specific aspects of RL and LLMs, such as Reinforcement Learning from Human Feedback (RLHF) [ouyang2022training] and Direct Preference Optimization (DPO) [rafailov2024direct], but do not adequately explore fine-tuning, scaling, and benchmarks. Unlike other surveys that classify LLM functionalities in traditional RL tasks, this survey provides a structured overview of combining fine-tuning, RL, and scaling.

The authors explain that LLMs are trained to predict the next token in a sequence using Maximum Likelihood Estimation (MLE), which maximizes the probability of generating the correct sequence given an input:

$\mathcal{L}_{\text{MLE}} = \sum_{t=1}^{T} \log P_{\theta}(y_t \mid y_{<t}, X)$

where:

$X$ represents the input, such as a prompt or context
$Y = (y_1, y_2, ..., y_T)$ is the corresponding target sequence
$P_{\theta}(y_t \mid y_{<t}, X)$ denotes the model’s predicted probability for token $y_t$ , given preceding tokens

The paper details how the autoregressive text generation process of LLMs can be modeled as a sequential decision-making problem within a Markov Decision Process (MDP). The state $s_t$ represents the sequence of tokens generated so far, the action $a_t$ is the next token, and a reward $R(s_t, a_t)$ evaluates the quality of the output. The LLM’s policy $\pi_\theta$ is optimized to maximize the expected return:

$J(\pi_\theta) = \mathbb{E}\Bigl[\sum_{t=0}^\infty \gamma^t R(s_t,a_t)\Bigr]$

where:

$\gamma$ is the discount factor that determines how strongly future rewards influence current decisions

The paper also covers policy gradient methods, including REINFORCE [nguyen2017reinforcement], Curriculum Learning with MIXER, Self-Critical Sequence Training (SCST) and Advantage Actor-Critic (A2C/A3C) algorithms.

It emphasizes that integrating RL into LLM reasoning involves Supervised Fine-Tuning (SFT), Reward Model (RM) training, and RL fine-tuning. Early approaches used Proximal Policy Optimization (PPO) [schulman2017proximal] and Trust Region Policy Optimization (TRPO) [schulman2017trustregionpolicyoptimization] to optimize policies while constraining policy updates. The survey discusses improved alternatives, such as Direct Preference Optimization (DPO) [rafailov2024direct] and Group Relative Policy Optimization (GRPO) [yang2024qwen2, shao2024deepseekmath], which reformulate the alignment objective as a ranking-based loss function.

The paper details reward modeling, explaining that the probability of $y_j$ being preferred over $y_k$ under the Bradley-Terry model is:

$P\bigl(y_j \succ y_k \,\mid\, x;\theta \bigr) = \frac{\exp\bigl(R_\theta(x,\,y_j)\bigr)}{\exp\bigl(R_\theta(x,\,y_j)\bigr) \;+\; \exp\bigl(R_\theta(x,\,y_k)\bigr)}$

It further discusses the Plackett-Luce model for full or partial rankings of $m$ responses.

The survey dives into various policy optimization methods. It presents Odds Ratio Preference Optimization (ORPO), where the probability of $y_j$ being preferred over $y_k$ is:

$P_\phi\bigl(y_j \succ y_k \,\mid\, x\bigr) = \sigma\!\Bigl(\ln \frac{\pi_\phi(y_j \mid x)}{\pi_\phi(y_k \mid x)}\Bigr) = \frac{1}{1 + \exp\!\Bigl( \ln \frac{\pi_\phi(y_k \mid x)}{\pi_\phi(y_j \mid x)} \Bigr)}$

It describes PPO and Reinforcement Learning from Human Feedback (RLHF) [ouyang2022training], detailing that the clipped PPO objective is:

$\mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_t \Big[ \min\big( r_t(\theta) \, A_t,\, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\, A_t \big) \Big]$

where:

$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{ref}}(a_t|s_t)}$ denotes the probability ratio for an action $a_t$ in state $s_t$
$A_t$ is an estimator of the advantage function
$\epsilon$ is a hyperparameter controlling the allowable deviation from the previous policy

Also discussed is Reinforcement Learning from AI Feedback (RLAIF) [lee2023rlaif], TRPO, and DPO, where the DPO loss function is:

$\mathcal{L}^{\mathrm{DPO}}(\theta) = \mathbb{E}_{((x,y^+), y^-) \sim \mathcal{D}_{\mathrm{train}} \Bigl[ \sigma\!\Bigl( \beta \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\mathrm{ref}}(y^+ \mid x)} - \beta \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\mathrm{ref}}(y^- \mid x)} \Bigr) \Bigr]$

where:

$\pi_\theta$ is the learnable policy
$\pi_{\text{ref}}$ is a reference policy
$\sigma(\cdot)$ is the sigmoid function
$\beta$ is a scaling parameter
$\mathcal{D}_{\text{train}}$ is a dataset of triplets $(x, y^+, y^-)$ where $y^+$ is the preferred output over $y^-$

The paper also discusses Offline Reasoning Optimization (OREO) [oreo] and GRPO.

Further, the paper summarizes various supervised fine-tuning methods for LLMs, including instruction finetuning, dialogue finetuning, CoT reasoning finetuning and domain-specific finetuning. It explains distillation-based finetuning, where a student LLM is finetuned to reproduce the final answer and the reasoning chain of a teacher LLM, and preference and alignment SFT. Also presented are efficient finetuning techniques, highlighting PEFT strategies like LoRA [hu2021lora] and Adapters.

The survey categorizes test-time scaling methods into parallel scaling, sequential scaling, and search-based methods. It details Beam Search, Best-of-N Search, Compute-Optimal Scaling and CoT prompting [wei2022chain]. It presents Self-Consistency Decoding, ToT framework [yao2024tree] and Graph of Thoughts. Also discussed are confidence-based sampling, search against verifiers, and self-improvement via refinements.

The survey concludes by listing benchmarks for LLM post-training evaluation and discussing future research directions.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - mbzuai-oryx/Awesome-LLM-Post-training: Awesome Reasoning LLM Tutorial/Survey/Guide (29 stars)

Tweets

https://twitter.com/KhanSalmanH/status/1896452655029117051

https://twitter.com/woolr_/status/1940150106260742304

https://twitter.com/rohanpaul_ai/status/1896880268977504577

https://twitter.com/fly51fly/status/1896681851701674354

https://twitter.com/ucfmshah/status/1896555379443789993

https://twitter.com/ceobillionaire/status/1912506954288628000

YouTube

Show All Videos