Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

APOLLO Optimizer Innovations

Updated 16 November 2025
  • APOLLO Optimizer is a suite of modern algorithms designed for efficient deep learning training and combinatorial optimization via tailored, scalable techniques.
  • It employs innovations such as low-rank random projections and diagonal quasi-Newton updates to reduce memory overhead and computational complexity.
  • The MILP variant integrates machine learning predictions with trust-region correction steps, achieving significant gap reductions while preserving feasibility.

The APOLLO Optimizer refers to a set of modern optimization frameworks and algorithms sharing the “APOLLO” designation, advancing state-of-the-art training efficiency, memory scaling, and solution quality for both deep learning and structured combinatorial optimization. Notable APOLLO optimizers include (i) APOLLO for nonconvex stochastic optimization, a diagonal quasi-Newton scheme, and (ii) APOLLO for memory-efficient large-scale LLM pre-training, which achieves near-SGD memory costs while matching AdamW’s convergence. APOLLO also refers to an alternating predictor-corrector framework (Apollo-MILP) for mixed-integer linear programming. Each variant addresses distinct, domain-specific challenges via tailored algorithmic innovations.

1. APOLLO for Memory-Efficient Large-Scale Neural Network Optimization

The “APOLLO” optimizer for LLMs (Zhu et al., 6 Dec 2024) solves the critical bottleneck of optimizer state memory when training transformer-based architectures with towering parameter counts. AdamW requires storage of full first (MtM_t) and second moment (VtV_t) estimates per parameter, incurring a $2mn$-float overhead for parameter matrices WRm×nW \in \mathbb{R}^{m \times n}. APOLLO approximates such element-wise adaptive scaling by tracking first and second moment statistics in a much lower-dimensional subspace, leveraging structured channel-wise scaling and random projection for efficiency.

Algorithmic Structure

For parameter matrix WRm×nW \in \mathbb{R}^{m \times n} and batch gradient GtG_t, APOLLO executes the following core steps:

  1. Project gradient: Rt=PtGtR_t = P_t G_t, PtRr×mP_t \in \mathbb{R}^{r\times m}, sampled as a Gaussian random projection, rmr \ll m.
  2. Update AdamW-like low-rank moments: MtR,VtRRr×nM^R_t, V^R_t \in \mathbb{R}^{r\times n}.
  3. Compute channel-wise scaling: for each jj, st,jR=R^t[:,j]2/Rt[:,j]2s^R_{t,j} = \|\hat{R}_t[:,j]\|_2 / \|R_t[:,j]\|_2 using R^t=MtR/(VtR+ϵ)\hat{R}_t = M^R_t / (\sqrt{V^R_t} + \epsilon).
  4. Apply adaptive update: Wt+1=WtηαGtStW_{t+1} = W_t - \eta\alpha G_t S_t, where St=diag(stR)S_t = \mathrm{diag}(s^R_t), αn/r\alpha \approx \sqrt{n/r} corrects random projection bias.

A “rank-1” variant, APOLLO-Mini, collapses the projection to r=1r=1, producing a single scaling factor per parameter matrix and reducing the optimizer state to $2n+2$ floats, matching that of SGD with momentum.

Memory and Computational Analysis

Optimizer Optimizer state size SVD required Per-iteration complexity
AdamW $2mn$ No O(mn)O(mn)
GaLore/Fira mr+2nrmr + 2nr (plus SVD) Yes O(mr)+O(r3)O(mr) + O(r^3) per SVD
APOLLO $2nr + 2$ No O(nr)O(nr) (projection/update)
APOLLO-Mini $2n + 2$ No O(n)O(n)

APOLLO eliminates full-matrix moment tracking and periodic SVD costs (as in GaLore or Fira), maintaining computational simplicity and parallelizability.

Empirical Performance and System Impact

Across LLaMA pre-training experiments (7B/13B models), APOLLO matches or exceeds AdamW in perplexity, enables 3× larger batch sizes (and corresponding throughput) on the same GPUs, and permits naively distributed (DDP) or single-GPU 7B pre-training under 12GB of memory with INT8 quantization. Performance remains robust even at r=1r=1 (APOLLO-Mini), where the optimizer state cost approaches zero.

APOLLO’s channel-wise scaling empirically matches or slightly outperforms full element-wise AdamW on language modeling and fine-tuning (e.g., MMLU), even under aggressive memory constraints.

2. APOLLO: Diagonal Quasi-Newton for Stochastic Nonconvex Optimization

A separate “APOLLO” optimizer targets nonconvex stochastic objectives in deep learning (Ma, 2020), introducing a diagonally parameterized quasi-Newton update for scalable curvature adaptation.

Core Update Rule

Classical quasi-Newton algorithms (e.g., BFGS) maintain a dense Hessian approximation, which is infeasible (O(d2)O(d^2)) and potentially indefinite for high-dimensional, nonconvex objectives. APOLLO’s diagonal approximation updates, given by

Bt+1=Bt+Λ,Λ=stytstBtstst44Diag(st2),B_{t+1} = B_t + \Lambda, \quad \Lambda = \frac{s_t^\top y_t - s_t^\top B_t s_t}{\lVert s_t \rVert_4^4} \mathrm{Diag}(s_t^2),

enforce a “weak” secant condition parameter-wise. Negative or near-zero eigenvalues are addressed using the rectification

Dt=rectify(Bt,σ)=max(Bt,σ),D_t = \mathrm{rectify}(B_t, \sigma) = \max(|B_t|, \sigma),

ensuring positive-definite curvature for all updates.

Full Algorithm

Each iteration involves:

  • Momentum and exponential moving average (EMA) for gradients.
  • Diagonal quasi-Newton update of BtB_t as above.
  • Gradient preconditioning by Dt1D_t^{-1}.
  • Parameter update θt+1=θtηt+1dt+1\theta_{t+1} = \theta_t - \eta_{t+1} d_{t+1} with dt+1=Dt+11mt+1d_{t+1}=D_{t+1}^{-1} m_{t+1}.

Memory overhead is O(4d)O(4d) vs. O(3d)O(3d) for Adam, O(2d)O(2d) for SGD.

Convergence and Generalization

In convex online settings, APOLLO achieves O(T)O(\sqrt{T}) regret, matching adaptive optimizers such as Adam and RMSProp. In nonconvex stochastic optimization, APOLLO attains expected gradient norm minimization rates of O(logT/T)O(\log T/\sqrt{T}), also matching Adam-type algorithms. The scale hyperparameters η\eta and σ\sigma are coupled, simplifying tuning.

Across CIFAR-10 (ResNet-110), ImageNet (ResNeXt-50), One Billion Word (2-layer LSTM), and WMT’14 En→De (Transformer), APOLLO delivers fast convergence and competitive or superior test accuracy, test perplexity, or BLEU versus SGD, Adam, RAdam, AdaBelief, and AdaHessian.

3. APOLLO-MILP: Alternating Prediction-Correction for Mixed-Integer Linear Programming

In mixed-integer linear programming (MILP), the APOLLO framework (Liu et al., 3 Mar 2025) refers to Apollo-MILP, which combines machine learning–driven prediction with combinatorial correction steps, iteratively reducing problem size while preserving solution quality and feasibility.

Predict-and-Correct Workflow

At each iteration kk, maintain reduced instance I(k)\mathcal{I}^{(k)} with subset of variables P(k)P^{(k)} fixed:

  1. Prediction: Use GNN-based predictor pθp_\theta to compute marginal probabilities for unfixed variables. Extract partial solution x^(k)[P]\hat{x}^{(k)}[P] via thresholding k1k_1 largest and k0k_0 smallest.
  2. Correction: Solve trust-region sub-MILP around x^(k)[P]\hat{x}^{(k)}[P],

minxcxs.t.Axb,xu,x[P]x^[P]1Δ,\min_x c^\top x \quad s.t.\quad Ax \leq b,\, \ell \leq x \leq u,\, \|x[P] - \hat{x}[P]\|_1 \leq \Delta,

to obtain a reference solution x~(k)\tilde{x}^{(k)}.

  1. Fixing/Reduction: Fix indices where x^i=x~i\hat{x}_i = \tilde{x}_i (high confidence), then update P(k+1)P^{(k+1)}, further reducing I(k+1)\mathcal{I}^{(k+1)}.

This cycle continues until all integer variables are fixed or resource limits are met.

Uncertainty-Based Error Upper Bound (UEBO)

The uncertainty of predictions is evaluated through an upper bound on the (intractable) KL divergence between predicted and true marginals: DKL(pθ(xiI)q(xiI))H(pθ(xiI))+d(pθ,q),D_{KL}\bigl(p_\theta(x_i|\mathcal{I}) \| q(x_i|\mathcal{I})\bigr) \leq H(p_\theta(x_i|\mathcal{I})) + d(p_\theta, q), where H()H(\cdot) is entropy and d(,)d(\cdot,\cdot) is prediction-correction discrepancy. In practice, single-point prediction-correction consistency I[x^i=x~i]\mathbb{I}[\hat{x}_i=\tilde{x}_i] is used to guide variable fixing.

Empirical Results and Guarantees

On combinatorial auctions, set covering, item placement, and workload apportionment benchmarks, Apollo-MILP achieves up to 77.8% reduction in absolute gap versus Gurobi, 70.2% versus SCIP, and closes 100% of the gap on hard real-world datasets. The algorithm maintains feasibility at each reduction step and achieves faster and lower primal gap convergence versus all tested baselines.

4. Comparison of Methodologies and Domains

Variant Domain Main Technique Core Memory/Complexity Benefit
APOLLO (LLM) Deep learning Channel/tensor-wise scaling via low-rank random projection O(nr)O(nr) or O(n)O(n) optimizer state; no SVD
APOLLO (quasi-Newton) Deep learning Diagonal quasi-Newton with rectified update O(4d)O(4d) state; diagonal curvature adaptation
APOLLO-MILP MILP/combinatorial Alternating ML prediction and trust-region correction Problem-size reduction while preserving feasibility

While all variants address scaling—memory or combinatorial dimension—they are specialized for different mathematical problem classes: APOLLO for LLMs prioritizes memory-optimal adaptive learning; APOLLO (quasi-Newton) targets curvature adaptation in stochastic optimization; APOLLO-MILP applies prediction-correction with uncertainty quantification for MILP.

5. Practical Implications and System-Level Benefits

The memory and computational gains of APOLLO (LLM) have significant practical consequences:

  • Training of foundation models (e.g., LLaMA-13B) becomes feasible on mainstream (A100-80GB) hardware without aggressive model sharding or optimizer state offloading.
  • On low-end GPUs, pre-training models up to 7B parameters with INT8 quantization is achieved using a single 12GB card.
  • Batch size scaling is unlocked: APOLLO enables up to 3× throughput versus AdamW owing to optimizer state reduction and removal of SVD-induced stalls (GaLore/Fira).

APOLLO-MILP’s predictor-corrector logic yields substantial problem-size reductions and improved solution quality without risking infeasibility—a critical property for industrial and scientific MILP deployments.

6. Limitations, Compatibility, and Hyperparameter Tuning

  • APOLLO (LLM) requires choosing subspace rank rr; low rr (APOLLO-Mini) suffices for most settings but extremely aggressive compression may trade off some adaptivity.
  • No hyper-parameter tuning beyond AdamW defaults is generally required; APOLLO operates robustly with α=1\alpha=1 or n/r\sqrt{n/r}.
  • The framework is compatible with activation checkpointing, ZeRO, quantization, and arbitrary large-batch distributed training.
  • APOLLO-MILP presupposes access to high-performance MILP solvers for trust-region correction and requires ML predictor pretraining.

In summary, the APOLLO optimizer family provides complementary, scalable approaches to training efficiency and high-quality solution finding in both deep neural learning and combinatorial domains, anchoring their distinctive advantages on principled memory reduction, statistical uncertainty, and scalable correction mechanisms (Zhu et al., 6 Dec 2024, Ma, 2020, Liu et al., 3 Mar 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to APOLLO Optimizer.