Bottom-up Policy Optimization (BuPO)

Updated 24 December 2025

BuPO is a reinforcement learning framework that decomposes policies into modular internal components for more precise optimization.
It employs a two-phase training regime that first aligns internal layers before global policy optimization, enhancing sample efficiency and performance.
Empirical results show BuPO outperforms traditional methods by 2–4 Avg@K points on reasoning benchmarks and supports robust decision-making under distributional shifts.

Bottom-up Policy Optimization (BuPO) is a reinforcement learning (RL) paradigm that diverges from classical top-level policy RL by directly targeting the optimization of internal structure within models—whether through explicit modularity in value-based RL or through the internal hidden state evolution of LLMs. BuPO sheds light on fine-grained policy geometry and the design of hierarchical or modular RL algorithms for reasoning-centric deep networks and offline decision-making under distributional shift (Tan et al., 22 Dec 2025, Zhou, 2023).

1. Foundational Frameworks and Motivation

Traditional RL approaches typically treat the policy—whether parameterized directly or induced via deep networks—as a monolithic mapping from state (or context) to action. However, in architectures such as Transformers or in value-based RL with function approximation, the policy emerges from a complex composition of layered or hierarchical components. BuPO departs from standard conventions by leveraging this compositionality to design RL objectives and updates that align not only the overall policy but also its internal constituents. This principle can be instantiated via explicit bi-level value-policy optimization in offline RL (Zhou, 2023) or by targeting "internal policies" at intermediate layers of LLMs (Tan et al., 22 Dec 2025).

The core motivation is twofold: (1) to provide more targeted and robust optimization under distributional shift or when foundational representations must support complex reasoning, and (2) to exploit the multi-stage evolution of confidence, entropy, and task-specific computations within deep architectures for enhanced sample efficiency and generalization.

2. Decomposition of Policy in Layered Architectures

BuPO's instantiation in the LLM RL context (Tan et al., 22 Dec 2025) relies on a rigorous decomposition of the Transformer policy. In a decoder-only Transformer of depth $L$ , the residual representations after each layer $l$ are given by

$\mathbf{H}^l = \mathbf{H}^{(0)} + \sum_{i=1}^l \mathbf{A}^i + \sum_{j=1}^l \mathbf{F}^j$

with $\mathbf{A}^i = \mathrm{MHSA}(LN(\mathbf{H}^{i-1}))$ (multi-head self-attention), $\mathbf{F}^j = \mathrm{FFN}(LN(\mathbf{H}^{j-1} + \mathbf{A}^{j}))$ , and $\mathbf{E}_u$ the unembedding matrix. This additive structure induces:

Internal Layer Policy: The distribution

$\pi_{\mathrm{Layer}^l}(o_t|s_t) = \mathrm{softmax}\left(\mathbf{H}^l \mathbf{E}_u^\top \right)$

at any layer $l$ .

Internal Modular Policy: The isolated contribution of attention or FFN submodule in layer $l$ can be sampled via

$\pi_{\mathrm{ATTN}^l} = \mathrm{softmax}(\mathbf{A}^l \mathbf{E}_u^\top), \quad \pi_{\mathrm{FFN}^l} = \mathrm{softmax}(\mathbf{F}^l \mathbf{E}_u^\top)$

Thus, the full policy can be viewed as a stack and composition of internal (layer- and module-wise) policies, permitting direct analysis and optimization at partial depths.

3. Entropy Analysis and Internal Reasoning Dynamics

BuPO exposes that LLMs do not propagate decision certainty uniformly across layers. Instead, empirical entropy measurements reveal a progressive collapse:

Early layers: Maintain high entropy (large $H^l$ ), supporting broad exploration and combinatorial reasoning.
Later layers: Gradually reduce entropy, culminating in a low-entropy, confident final policy.
Model-specific patterns: Llama models show sharp entropy reduction only at the topmost layers, while Qwen and particularly Qwen3 exhibit a more human-like gradual "exploration–integration–convergence" regime within FFNs, with clear phase transitions in entropy change $\Delta H_{\mathrm{FFN}^l}$ .

This stratified pattern suggests the appropriateness of staged, bottom-up alignment protocols, motivating the design of BuPO's training curriculum (Tan et al., 22 Dec 2025).

4. BuPO Optimization Objective and Algorithmic Implementation

The BuPO objective introduces a two-phase RL training protocol:

Phase I (Internal Layer Alignment): For the first $s_{\mathrm{inter}}$ optimization steps, maximizes a group-response PPO (GRPO)-style surrogate $\mathcal{J}_{\mathrm{InterGRPO}}$ using the internal layer policy at a chosen depth $l$ , with the policy update and backpropagation localized to layers $\leq l$ .

$\mathcal{J}_{\mathrm{InterGRPO}}(\pi_{\mathrm{Layer}^l}) = \mathbb{E}\left[\min\left(\hat{r}_{i,t} \hat{A}_{i,t}, \mathrm{clip}(\hat{r}_{i,t}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,t}\right)\right]$

where $\hat{r}_{i,t} = \pi_{\mathrm{Layer}^l}(o_{i,t})/\pi_{\mathrm{Layer,old}^l}(o_{i,t})$ .

Phase II (Full Policy Optimization): Subsequent steps maximize the usual GRPO surrogate $\mathcal{J}_{\mathrm{GRPO}}$ with the rollout policy at the top layer.

The full training loop consists of sampling prompts, generating rollouts, computing rewards and advantages, and updating parameters either in a restricted or unrestricted fashion, as dictated by the current training phase. The two-stage regime is strictly scheduled:

$\mathcal{J}_{\mathrm{BuPO}} = \begin{cases} \mathcal{J}_{\mathrm{InterGRPO}}(\pi_{\mathrm{Layer}^l}) & s_{\mathrm{cur}} \leq s_{\mathrm{inter}} \ \mathcal{J}_{\mathrm{GRPO}}(\pi_\theta) & s_{\mathrm{cur}} > s_{\mathrm{inter}} \end{cases}$

This structure encourages foundational alignment and representation learning at intermediate layers before final global optimization, with empirical evidence that moderate $s_{\mathrm{inter}}$ yields stability and avoids undesirable policy collapse.

5. Experimental Results and Quantitative Comparison

BuPO has been evaluated on complex reasoning benchmarks, notably AMC12, MATH500, AIME24, and AIME25. Main findings include (Tan et al., 22 Dec 2025):

Model/Baseline	Avg@K (Qwen3-4B)	Avg@K (Qwen3-8B)	Avg@K (Llama-3B)	Avg@K (Llama-8B)
Base	47.44	—	—	—
PPO	55.22	—	—	—
RLOO	54.00	—	—	—
GRPO	55.08	—	—	—
BuPO	58.51	+2.13	+1.01	+3.68

BuPO consistently outperforms all baselines by 2–4 Avg@K points and maintains superior Pass@K for all $K \in [1, 256]$ . Similar outperformance is observed for all tested model architectures, confirming the benefit of bottom-up policy alignment.

6. Broader Implications and Theoretical Foundations

The BuPO paradigm generalizes beyond LLMs. In the context of offline RL with function approximation and limited exploration (Zhou, 2023), a related bi-level bottom-up policy optimization design is introduced:

Lower Level: Construct a value confidence set $Q_{\epsilon_n}$ by enforcing (weighted) Bellman error bounds and detection penalties, effectively capturing support extrapolation uncertainty.
Upper Level: Maximize a conservative value function lower bound $J_n^-(\pi;\tau)$ by optimizing $\pi$ over the value confidence set.
Algorithm: Performs adversarial penalized saddle-point optimization over $(q, \tau)$ , followed by mirror-descent policy updates.

Rigorous regret and sample complexity bounds are established, showing polynomial efficiency under realizability and function class complexity conditions, and confirming robust performance—even with partial dataset coverage and off-support bias.

Empirically, this framework achieves state-of-the-art performance across synthetic control, D4RL, and real-world medical RL settings, robustly handling distributional shift due to limited exploration.

7. Key Insights and Limitations

Progressive Internal Reasoning: BuPO operationalizes the finding that deep networks, especially LLMs, build answers via a multi-phase process: broad early exploration, intermediate integration, and late answer convergence.
Feature Refinement: Aligning internal policies increases the similarity between lower-layer representations and top-level outputs, effectively moving crucial reasoning load earlier in the network.
Algorithmic Stability: Overly aggressive internal alignment (excessive $s_{\mathrm{inter}}$ ) can degrade final policy quality; moderate duration yields best end-to-end outcomes.
Generalizability: The concept of explicitly optimizing "internal" components generalizes to offline RL and may inform future modular RL design.

The bottom-up approach offers a principled framework for modular RL and representation alignment, with empirical and theoretical support that extends beyond a single architecture or domain. BuPO raises important directions for the design of future RL and reasoning algorithms in both language and control domains (Tan et al., 22 Dec 2025, Zhou, 2023).