- The paper surveys comprehensive post-training techniques for LLMs, categorizing them into fine-tuning, reinforcement learning, and test-time scaling to improve reasoning abilities.
- Post-training refines LLMs beyond pre-training, focusing on enhancing reasoning, factual accuracy, and alignment with complex tasks and user intents.
- Key techniques discussed include parameter-efficient fine-tuning (like LoRA), various RL algorithms (such as DPO), and advanced scaling methods (like Chain-of-Thought and Tree-of-Thought).
The survey paper presents a comprehensive analysis of post-training methodologies for LLMs, categorizing these techniques into fine-tuning, Reinforcement Learning (RL), and test-time scaling. It addresses the shift in focus from pre-training on extensive datasets to refining LLMs through post-training to enhance reasoning, factual accuracy, and alignment with user intents and ethical standards.
The paper highlights the significance of fine-tuning in adapting pre-trained LLMs to specific tasks, using parameter-efficient techniques like Low-Rank Adaptation (LoRA) [hu2021lora] and adapters to mitigate overfitting and computational costs. While fine-tuning improves performance in sentiment analysis, question answering, and medical diagnosis, it faces challenges such as out-of-domain generalization.
It explores RL's role in refining LLMs, optimizing sequential decision-making using dynamic feedback, and improving adaptability. The paper contrasts conventional RL with RL in LLMs, emphasizing the complexities of vast vocabulary selection, state representation as growing text sequences, and reliance on subjective feedback. It also touches on hybrid approaches that combine process-based rewards (e.g., chain-of-thought reasoning) and outcome-based evaluations to refine learning.
Scaling techniques are crucial for enhancing LLM performance and efficiency. The survey discusses methods like Chain-of-Thought (CoT) [wei2022chain] reasoning and Tree-of-Thought (ToT) [yao2024tree] frameworks, which improve multi-step reasoning by decomposing complex problems into sequential steps. It also addresses the use of Low-Rank Adaptation (LoRA) [hu2021lora], adapters, and Retrieval-Augmented Generation (RAG) [asai2023self, gao2023retrieval] to improve computational efficiency and factual accuracy. The survey also discusses test-time scaling, which dynamically adjusts parameters based on task complexity.
The paper contrasts its approach with prior surveys, noting that earlier works often focus on specific aspects of RL and LLMs, such as Reinforcement Learning from Human Feedback (RLHF) [ouyang2022training] and Direct Preference Optimization (DPO) [rafailov2024direct], but do not adequately explore fine-tuning, scaling, and benchmarks. Unlike other surveys that classify LLM functionalities in traditional RL tasks, this survey provides a structured overview of combining fine-tuning, RL, and scaling.
The authors explain that LLMs are trained to predict the next token in a sequence using Maximum Likelihood Estimation (MLE), which maximizes the probability of generating the correct sequence given an input:
LMLE=t=1∑TlogPθ(yt∣y<t,X)
where:
- X represents the input, such as a prompt or context
- Y=(y1,y2,...,yT) is the corresponding target sequence
- Pθ(yt∣y<t,X) denotes the model’s predicted probability for token yt, given preceding tokens
The paper details how the autoregressive text generation process of LLMs can be modeled as a sequential decision-making problem within a Markov Decision Process (MDP). The state st represents the sequence of tokens generated so far, the action at is the next token, and a reward R(st,at) evaluates the quality of the output. The LLM’s policy πθ is optimized to maximize the expected return:
J(πθ)=E[t=0∑∞γtR(st,at)]
where:
- γ is the discount factor that determines how strongly future rewards influence current decisions
The paper also covers policy gradient methods, including REINFORCE [nguyen2017reinforcement], Curriculum Learning with MIXER, Self-Critical Sequence Training (SCST) and Advantage Actor-Critic (A2C/A3C) algorithms.
It emphasizes that integrating RL into LLM reasoning involves Supervised Fine-Tuning (SFT), Reward Model (RM) training, and RL fine-tuning. Early approaches used Proximal Policy Optimization (PPO) [schulman2017proximal] and Trust Region Policy Optimization (TRPO) [schulman2017trustregionpolicyoptimization] to optimize policies while constraining policy updates. The survey discusses improved alternatives, such as Direct Preference Optimization (DPO) [rafailov2024direct] and Group Relative Policy Optimization (GRPO) [yang2024qwen2, shao2024deepseekmath], which reformulate the alignment objective as a ranking-based loss function.
The paper details reward modeling, explaining that the probability of yj being preferred over yk under the Bradley-Terry model is:
P(yj≻yk∣x;θ)=exp(Rθ(x,yj))+exp(Rθ(x,yk))exp(Rθ(x,yj))
It further discusses the Plackett-Luce model for full or partial rankings of m responses.
The survey dives into various policy optimization methods. It presents Odds Ratio Preference Optimization (ORPO), where the probability of yj being preferred over yk is:
Pϕ(yj≻yk∣x)=σ(lnπϕ(yk∣x)πϕ(yj∣x))=1+exp(lnπϕ(yj∣x)πϕ(yk∣x))1
It describes PPO and Reinforcement Learning from Human Feedback (RLHF) [ouyang2022training], detailing that the clipped PPO objective is:
LPPO(θ)=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
where:
- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{ref}}(a_t|s_t)}$ denotes the probability ratio for an action at in state st
- At is an estimator of the advantage function
- ϵ is a hyperparameter controlling the allowable deviation from the previous policy
Also discussed is Reinforcement Learning from AI Feedback (RLAIF) [lee2023rlaif], TRPO, and DPO, where the DPO loss function is:
$\mathcal{L}^{\mathrm{DPO}}(\theta) = \mathbb{E}_{((x,y^+), y^-) \sim \mathcal{D}_{\mathrm{train}} \Bigl[ \sigma\!\Bigl( \beta \log \frac{\pi_\theta(y^+ \mid x)}{\pi_{\mathrm{ref}}(y^+ \mid x)} - \beta \log \frac{\pi_\theta(y^- \mid x)}{\pi_{\mathrm{ref}}(y^- \mid x)} \Bigr) \Bigr]$
where:
- πθ is the learnable policy
- πref is a reference policy
- σ(⋅) is the sigmoid function
- β is a scaling parameter
- Dtrain is a dataset of triplets (x,y+,y−) where y+ is the preferred output over y−
The paper also discusses Offline Reasoning Optimization (OREO) [oreo] and GRPO.
Further, the paper summarizes various supervised fine-tuning methods for LLMs, including instruction finetuning, dialogue finetuning, CoT reasoning finetuning and domain-specific finetuning. It explains distillation-based finetuning, where a student LLM is finetuned to reproduce the final answer and the reasoning chain of a teacher LLM, and preference and alignment SFT. Also presented are efficient finetuning techniques, highlighting PEFT strategies like LoRA [hu2021lora] and Adapters.
The survey categorizes test-time scaling methods into parallel scaling, sequential scaling, and search-based methods. It details Beam Search, Best-of-N Search, Compute-Optimal Scaling and CoT prompting [wei2022chain]. It presents Self-Consistency Decoding, ToT framework [yao2024tree] and Graph of Thoughts. Also discussed are confidence-based sampling, search against verifiers, and self-improvement via refinements.
The survey concludes by listing benchmarks for LLM post-training evaluation and discussing future research directions.