Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tree-Gate Proximal Policy Optimization

Updated 24 November 2025
  • TGPPO is a reinforcement learning framework that leverages tree-gate transformer architecture to select optimal branching variables in MILPs.
  • It formulates branch-and-bound as a Markov Decision Process, using on-policy PPO and generalized advantage estimation to effectively reduce node counts and primal-dual integrals.
  • Empirical evaluations show TGPPO significantly outperforms traditional and learned baselines, especially for out-of-distribution MILP instances.

Tree-Gate Proximal Policy Optimization (TGPPO) is a reinforcement learning (RL) framework for learning branching policies in Mixed Integer Linear Programs (MILPs) solved by the branch-and-bound (B&B) paradigm. TGPPO addresses the challenge of generalization across structurally diverse MILP instances by employing on-policy policy optimization, leveraging a transformer-based architecture to encode the combinatorial context of the evolving B&B search tree. Empirical evidence demonstrates that TGPPO surpasses both hand-crafted and existing learned branching strategies in terms of node count reduction and primal-dual integral, particularly for out-of-distribution instances (Mhamed et al., 17 Nov 2025).

1. Branch-and-Bound Policy Learning as Markov Decision Process

TGPPO formulates the B&B procedure as a Markov Decision Process (MDP), where the agent interacts with a MILP solver (SCIP). At decision step tt, the state sts_t consists of:

  • CtC_t: The set of candidate fractional variables for branching, each with a feature vector ct,iRdcc_{t,i}\in\mathbb{R}^{d_c}, dc=25d_c=25.
  • TreetTree_t: Global and local tree statistics partitioned into ntRdnn_t\in\mathbb{R}^{d_n} (dn=8d_n=8) and mtRdmm_t\in\mathbb{R}^{d_m} (dm=53d_m=53).

The set of available actions at{1,,Ct}a_t\in \{1, \ldots, |C_t|\} corresponds to selecting a candidate variable to branch. The solver updates the state by expanding the B&B tree according to SCIP’s default node-selection. The scalar reward rtr_t penalizes expansion of large trees and lack of progress, normalized relative to a baseline policy. The transition and reward mechanisms induce variable-length episodes, terminating with optimality or time/resource limits.

2. State Space and Tree-Gate Representation

TGPPO employs a parameterized tree-gate encoding inspired by tree-aware transformer architectures to flexibly represent B&B states. The feature pipeline is:

  • Project candidate features: c~t,i=WcLN(ct,i)Rdh\tilde{c}_{t,i} = W_c\,LN(c_{t,i}) \in \mathbb{R}^{d_h}, WcRdh×dcW_c\in \mathbb{R}^{d_h\times d_c}.
  • Project global tree context: t~t=WtLN([nt;mt])Rdh\tilde{t}_{t} = W_t\,LN([n_t;m_t]) \in \mathbb{R}^{d_h}, WtRdh×(dn+dm)W_t\in\mathbb{R}^{d_h\times(d_n+d_m)}.
  • Concatenate and linearly transform: zt,i(0)=Wg[c~t,i;t~t]Rdhz_{t,i}^{(0)} = W_g[\tilde{c}_{t,i};\tilde{t}_t] \in \mathbb{R}^{d_h}, WgRdh×2dhW_g\in\mathbb{R}^{d_h\times 2d_h}.

The set Zt(0)=[zt,1(0),,zt,Ct(0)]Z_t^{(0)} = [z_{t,1}^{(0)}, \ldots, z_{t,|C_t|}^{(0)}] is permutation-equivariant, allowing the representation of variable-arity candidate sets. A padding mask Mpad{0,1}CtM_{\text{pad}}\in\{0,1\}^{|C_t|} enables batching.

3. Network Architecture and Policy Parameterization

TGPPO utilizes an actor-critic model with a shared network core:

  • Transformer Encoder: An NN-layer transformer, respecting MpadM_{\text{pad}}, processes Zt(0)Z_t^{(0)}; output Zt(N)Z_t^{(N)}.
  • Bi-directional Matching Block: Implements mutual attention between transformed candidate (zt,i(N)z_{t,i}^{(N)}) and tree context (t~t\tilde{t}_t) features, using:
    • αt,i=softmaxi((Wt1t~t)zt,i(N))\alpha_{t,i} = \mathrm{softmax}_i((W_{t1}\tilde{t}_t)^\top z_{t,i}^{(N)})
    • βt,i=softmaxi((Wc1zt,i(N))t~t)\beta_{t,i} = \mathrm{softmax}_i((W_{c1}z_{t,i}^{(N)})^\top\tilde{t}_t)
    • Aggregation and fusion: rt,i=σ(W3et+W4dt,i)et+[1σ(W3et+W4dt,i)]dt,ir_{t,i} = \sigma(W_3e_t + W_4d_{t,i})\odot e_t + [1-\sigma(W_3e_t + W_4d_{t,i})]\odot d_{t,i}

Actor Head (Tree-Gated MLP):

  • Applies KK successive tree-gated layers: qt,i(k)=fk(qt,i(k1)g(k))q_{t,i}^{(k)} = f_k(q_{t,i}^{(k-1)}\odot g^{(k)}), with g(k)=σ(Ukt~t)g^{(k)} = \sigma(U_k\tilde{t}_t) and q(0)=rt,iq^{(0)} = r_{t,i}.
  • Computes logits t,i\ell_{t,i}; policy: πθ(ist)=softmaxi([t,1,...,t,Ct])\pi_\theta(i|s_t) = \mathrm{softmax}_i([\ell_{t,1},..., \ell_{t,|C_t|}]).

Critic Head (Value Function):

  • Computes mean-pooled summary rˉt\bar{r}_t, concatenates with tree context, and passes through a two-layer MLP and tree-gated reduction to obtain Vϕ(st)V_\phi(s_t).

4. Reinforcement Learning Objective and PPO Algorithm

The RL objective for TGPPO employs Proximal Policy Optimization (PPO), utilizing:

  • Cumulative discounted return: Rt==0Ttγrt+R_t = \sum_{\ell=0}^{T-t} \gamma^\ell r_{t+\ell} (typically with γ0.97\gamma\approx0.97).
  • Generalized Advantage Estimation (GAE): AtA_t computed from advantage residuals δt=rt+γVϕ(st+1)Vϕ(st)\delta_t=r_t+\gamma V_\phi(s_{t+1})-V_\phi(s_t).
  • PPO clipped surrogate objective:

LPPO(θ)=Et[min(rt(θ)At,clip(rt(θ),1ϵ,1+ϵ)At)]L^{\text{PPO}}(\theta) = \mathbb{E}_t\left[\min(r_t(\theta)A_t, \operatorname{clip}(r_t(\theta),1-\epsilon,1+\epsilon)A_t)\right]

with rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}.

The full objective combines policy, value, and entropy terms:

L(θ,ϕ)=LPPO(θ)+λ1Es[(Vϕ(s)R(s))2]λ2Es[H(πθ(s))]L(\theta,\phi) = -L^{\text{PPO}}(\theta) + \lambda_1\mathbb{E}_s\left[(V_\phi(s)-R(s))^2\right] - \lambda_2\mathbb{E}_s[H(\pi_\theta(\cdot|s))]

where H()H(\cdot) denotes the entropy bonus.

5. Training Methodology and Hyperparameter Regime

TGPPO is trained via episodic on-policy RL as outlined in Algorithm 1. Each episode proceeds as follows:

  1. Initialize parameters (θ,ϕ)(\theta,\phi); reset buffer.
  2. Sample MILP instance and seed; reset SCIP state (for data augmentation).
  3. For each timestep tt:
    • Extract candidate and tree features; form sts_t.
    • Sample action atπθ(st)a_t\sim\pi_\theta(\cdot|s_t); execute in SCIP, observe rtr_t and st+1s_{t+1}.
    • Store (st,at,rt,st+1)(s_t,a_t,r_t,s_{t+1}) in buffer.
  4. On episode end or buffer full, compute returns and GAE advantages.
  5. For EE epochs, sample mini‐batches, update (θ,ϕ)(\theta,\phi) via AdamW on combined loss.
  6. Repeat for subsequent episodes.

Key hyperparameters: dh=256d_h=256, N=5N=5 transformer layers, 8 heads, ϵ=0.16\epsilon=0.16, λ=0.92\lambda=0.92 (GAE), batch size 256, E=3E=3, reward variant H3; all obtained via nested cross-validation/Optuna.

6. Experimental Evaluation and Results

TGPPO's empirical evaluation utilizes a training set of 25 MIPLIB 3/2010/2017 + CORAL instances (each with 5 SCIP seeds, yielding 125 episodes) and a test set of 66 held-out problems (33 “easy,” 33 “hard”). Experimental protocols employ a one-hour solve cap per test instance. Metrics:

  • Node count (NnodesN_\text{nodes}): Employed on “easy” instances.
  • Primal-dual integral (PDI): Used for “hard” cases.

Reporting uses the shifted geometric mean (SGM) and head-to-head comparisons.

Key findings:

  • Against the state-of-the-art tbrant learner, TGPPO achieves superior performance in 78.8% of instances by node count and 90.6% by PDI.
  • TGPPO outperforms classical and learned baselines (pscost, relpscost, brant, ltbrant, tree) on 70%–90% of test instances.
  • Friedman–Nemenyi tests confirm statistically significant improvements (p < 0.001).
  • Comparative scatter plots demonstrate that the majority of test instances favor TGPPO (points fall below the diagonal).

7. Conclusions, Implications, and Future Directions

TGPPO establishes that on-policy PPO training mitigates overfitting to expert demonstrations, a notable limitation in previous imitation-learning-based policies. The tree-gate transformer architecture accommodates variable-length candidate sets and captures hierarchical tree context, enabling generalization across diverse MILP structures. TGPPO substantially lowers computational effort (nodes expanded) on easy instances and reduces PDI on challenging ones, enhancing anytime solution quality under compute constraints.

Residual performance gaps relative to expert heuristics such as relpscost indicate the potential benefit of integrating TGPPO with end-to-end solvers, enriched graph-based representations, and advanced reward schemes (e.g., risk-sensitive objectives). A plausible implication is that this approach provides a robust foundation for automated solver design, especially in scenarios with highly heterogeneous or previously unseen MILP distributions (Mhamed et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tree-Gate Proximal Policy Optimization (TGPPO).