Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 65 tok/s

Gemini 2.5 Pro 53 tok/s Pro

GPT-5 Medium 35 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

MBPO-PPO: Memory-Optimized PPO in RLHF

Updated 15 September 2025

MBPO-PPO is a set of optimizations for PPO in RLHF that merges the SFT and reward models into a unified Hydra architecture, reducing redundant parameter storage.
Dynamic LoRA switching enables efficient deactivation of additional weights, decreasing the memory footprint by approximately 20% and lowering per-sample latency by up to 65%.
The combined techniques improve throughput and scalability, facilitating broader access to RLHF techniques on resource-constrained hardware for large language models.

MBPO-PPO refers to a set of memory and performance optimizations applied to Proximal Policy Optimization (PPO) within the context of Reinforcement Learning with Human Feedback (RLHF), specifically through the integration and dynamic configuration of model components as described in "Efficient RLHF: Reducing the Memory Usage of PPO" (Santacroce et al., 2023). These techniques are aimed at substantially reducing memory requirements and training latency for RLHF pipelines, thereby making PPO-based alignment methods more accessible for LLMs in practical deployments.

1. Context: RLHF and PPO Memory Barriers

Reinforcement Learning with Human Feedback (RLHF) is widely adopted in aligning LLMs with human preferences. A central bottleneck in RLHF arises during the PPO stage, which, in typical implementations, requires more than three times the memory footprint of supervised fine-tuning (SFT). This significant memory overhead primarily stems from duplicative model storage: conventional RLHF pipelines require holding separate models for SFT/reference (static), the actor (trainable), the reward model (RM), and the critic—each potentially at full scale. In practice, this impedes RLHF deployment for large models and restricts experimentation on resource-constrained hardware (Santacroce et al., 2023).

2. Integration of SFT and Reward Model: The Hydra Architecture

A foundational innovation in MBPO-PPO is the merger of the SFT and reward models into a single entity, termed the "Hydra" model. Traditionally, RLHF segregates the reference/SFT and reward models into distinct architectures, each sharing substantial parameter overlap (typically initialized from the same checkpoint).

The Hydra model unifies both the causal LLMing head and the reward prediction head within a single network, denoted as $\pi^\text{hydra}$ . The corresponding loss function integrates SFT and RM objectives:

$\mathcal{L}_{\pi^\text{hydra}}(x, y_w, y_l) = \mathcal{L}_\text{xent}(x, y_w) + \gamma \cdot \mathcal{L}_R(x, y_w, y_l)$

where $x$ represents the prompt, $y_w$ and $y_l$ denote the winning and losing completions from preference comparisons, $\mathcal{L}_\text{xent}$ is the cross-entropy loss for SFT, $\mathcal{L}_R$ is the reward model loss, and $\gamma$ controls their relative weighting. This framework obviates the need to store redundant parameter sets, reducing the number of resident static models during PPO from four to one (see Table 2 of (Santacroce et al., 2023)).

Model Configuration	Traditional Setup	Hydra Model Setup
# Static Models in PPO	≥4	1
Parameter Storage Redundancy	High	Minimal

This architectural merger simplifies both parameter storage and compute graphs, directly addressing the memory inefficiency inherent in RLHF workflows.

3. Dynamic Low-Rank Adaptation: LoRA Switching

Another critical component of MBPO-PPO is the dynamic activation or "switching" of LoRA (Low-Rank Adaptation) modules. LoRA is conventionally deployed by appending small trainable matrices ("LoRA weights") to a large, frozen base model. In standard LoRA-PPO, separate model weights are maintained for the actor and a static reference (both copies of the SFT), whose only difference lies in their respective LoRA adaptations.

MBPO-PPO leverages this by dynamically deactivating the LoRA modules to instantly recover the reference policy, mathematically defined as:

$\pi_\text{ref} \leftarrow \text{LO}(\pi_\theta)$

where $\text{LO}(\cdot)$ is an operator that ignores LoRA weights. The same approach is used to retrieve the critic from the reward head. By retaining only one copy of the base model and independently storing the small LoRA weight sets for the actor and critic, memory required for model storage is further reduced. The reported savings are approximately 20% compared to maintaining two full model copies in memory for LoRA-PPO (Santacroce et al., 2023).

4. Performance Metrics: Memory, Latency, and Throughput

MBPO-PPO demonstrates notable empirical gains in both memory consumption and runtime efficiency. Specifically:

Memory Usage: LoRA-enabled PPO yields memory usage lower than SFT, and dynamic LoRA switching in Hydra-PPO achieves further reduction—approximately 20%—relative to conventional LoRA-PPO.
Latency per Sample: LoRA-PPO incurs a per-sample latency of roughly 18.75 seconds, whereas Hydra-PPO reduces this to about 6.47 seconds, representing up to a 65% reduction.
Throughput and Scaling: Hydra-PPO supports larger training batch sizes owing to lower memory overhead. Throughput experiments (referenced in Figure 1 of (Santacroce et al., 2023)) show that Hydra-PPO scales more robustly as sequence length increases, while traditional methods experience rapidly degrading throughput.

The surrogate PPO training objective remains unchanged, preserving standard stability and convergence properties:

$\mathcal{L}^\text{CLIP}(\theta) = \mathbb{E} \left[ \min\left( r(\theta)\hat{A}, \operatorname{clip}(r(\theta), 1-\epsilon, 1+\epsilon)\hat{A} \right) \right]$

where $r(\theta) = \frac{\pi_\theta(y|x)}{\pi_\text{old}(y|x)}$ and $\hat{A}$ denotes the advantage estimate.

5. Practical Implications for RLHF Pipelines

The combination of Hydra model merging and dynamic LoRA enables RLHF with PPO to be executed on hardware with significantly fewer memory resources. This technical advance lowers access barriers for practitioners wishing to perform RLHF on large models, since only a single copy of the base model plus small LoRA weight sets for actor and critic are needed. The reduced per-sample latency and better throughput scaling also simplify scaling up batch sizes and experimenting with sequence length, which are previously limited by hardware memory.

A plausible implication is that the improved hardware efficiency will facilitate broader access to alignment training techniques and catalyze more extensive experimentation with RLHF variants. Additionally, by minimizing redundant static model storage, the approach could serve as a reference baseline for further research on model efficiency and RLHF in LLMs.

6. Limitations and Considerations

While MBPO-PPO addresses core memory and runtime bottlenecks, the surrogate PPO objective remains unchanged and thus is subject to known PPO limitations, such as sensitivity to hyperparameters and reward misalignment. The architecture presumes compatibility with LoRA-based adaptation; alternative adapter strategies would require verification for seamless integration in the Hydra-PPO paradigm.

Further, the one-to-one mapping between Hydra-SFT heads and RLHF components potentially complicates modular model extension, e.g., multi-reward heads or multi-reference models would require altered head-sharing logic. Nevertheless, this technique preserves RLHF task performance while delivering quantifiable system-level improvements in efficiency, as reflected in reported benchmarks.

7. Summary of Contributions and Impact

MBPO-PPO, as exemplified by Hydra-RLHF and dynamic LoRA switching, consolidates the SFT and reward models into a single multi-headed network and leverages the dynamic deactivation of LoRA modules to minimize redundant parameter storage. This reduces the number of static models involved in PPO from four to one and achieves a reported 20% memory reduction compared to previous LoRA-PPO approaches, accompanied by a 65% reduction in per-sample latency. These optimizations are directly linked to broader accessibility and scalability of PPO-based RLHF pipelines, particularly for LLMs and resource-constrained deployments (Santacroce et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Efficient RLHF: Reducing the Memory Usage of PPO (2023)

Follow Topic

Get notified by email when new papers are published related to MBPO-PPO.