APOLLO Optimizer Innovations

Updated 16 November 2025

APOLLO Optimizer is a suite of modern algorithms designed for efficient deep learning training and combinatorial optimization via tailored, scalable techniques.
It employs innovations such as low-rank random projections and diagonal quasi-Newton updates to reduce memory overhead and computational complexity.
The MILP variant integrates machine learning predictions with trust-region correction steps, achieving significant gap reductions while preserving feasibility.

The APOLLO Optimizer refers to a set of modern optimization frameworks and algorithms sharing the “APOLLO” designation, advancing state-of-the-art training efficiency, memory scaling, and solution quality for both deep learning and structured combinatorial optimization. Notable APOLLO optimizers include (i) APOLLO for nonconvex stochastic optimization, a diagonal quasi-Newton scheme, and (ii) APOLLO for memory-efficient large-scale LLM pre-training, which achieves near-SGD memory costs while matching AdamW’s convergence. APOLLO also refers to an alternating predictor-corrector framework (Apollo-MILP) for mixed-integer linear programming. Each variant addresses distinct, domain-specific challenges via tailored algorithmic innovations.

1. APOLLO for Memory-Efficient Large-Scale Neural Network Optimization

The “APOLLO” optimizer for LLMs (Zhu et al., 6 Dec 2024) solves the critical bottleneck of optimizer state memory when training transformer-based architectures with towering parameter counts. AdamW requires storage of full first ( $M_t$ ) and second moment ( $V_t$ ) estimates per parameter, incurring a $2mn$-float overhead for parameter matrices $W \in \mathbb{R}^{m \times n}$ . APOLLO approximates such element-wise adaptive scaling by tracking first and second moment statistics in a much lower-dimensional subspace, leveraging structured channel-wise scaling and random projection for efficiency.

Algorithmic Structure

For parameter matrix $W \in \mathbb{R}^{m \times n}$ and batch gradient $G_t$ , APOLLO executes the following core steps:

Project gradient: $R_t = P_t G_t$ , $P_t \in \mathbb{R}^{r\times m}$ , sampled as a Gaussian random projection, $r \ll m$ .
Update AdamW-like low-rank moments: $M^R_t, V^R_t \in \mathbb{R}^{r\times n}$ .
Compute channel-wise scaling: for each $j$ , $s^R_{t,j} = \|\hat{R}_t[:,j]\|_2 / \|R_t[:,j]\|_2$ using $\hat{R}_t = M^R_t / (\sqrt{V^R_t} + \epsilon)$ .
Apply adaptive update: $W_{t+1} = W_t - \eta\alpha G_t S_t$ , where $S_t = \mathrm{diag}(s^R_t)$ , $\alpha \approx \sqrt{n/r}$ corrects random projection bias.

A “rank-1” variant, APOLLO-Mini, collapses the projection to $r=1$ , producing a single scaling factor per parameter matrix and reducing the optimizer state to $2n+2$ floats, matching that of SGD with momentum.

Memory and Computational Analysis

Optimizer	Optimizer state size	SVD required	Per-iteration complexity
AdamW	$2mn$	No	$O(mn)$
GaLore/Fira	$mr + 2nr$ (plus SVD)	Yes	$O(mr) + O(r^3)$ per SVD
APOLLO	$2nr + 2$	No	$O(nr)$ (projection/update)
APOLLO-Mini	$2n + 2$	No	$O(n)$

APOLLO eliminates full-matrix moment tracking and periodic SVD costs (as in GaLore or Fira), maintaining computational simplicity and parallelizability.

Empirical Performance and System Impact

Across LLaMA pre-training experiments (7B/13B models), APOLLO matches or exceeds AdamW in perplexity, enables 3× larger batch sizes (and corresponding throughput) on the same GPUs, and permits naively distributed (DDP) or single-GPU 7B pre-training under 12GB of memory with INT8 quantization. Performance remains robust even at $r=1$ (APOLLO-Mini), where the optimizer state cost approaches zero.

APOLLO’s channel-wise scaling empirically matches or slightly outperforms full element-wise AdamW on language modeling and fine-tuning (e.g., MMLU), even under aggressive memory constraints.

2. APOLLO: Diagonal Quasi-Newton for Stochastic Nonconvex Optimization

A separate “APOLLO” optimizer targets nonconvex stochastic objectives in deep learning (Ma, 2020), introducing a diagonally parameterized quasi-Newton update for scalable curvature adaptation.

Core Update Rule

Classical quasi-Newton algorithms (e.g., BFGS) maintain a dense Hessian approximation, which is infeasible ( $O(d^2)$ ) and potentially indefinite for high-dimensional, nonconvex objectives. APOLLO’s diagonal approximation updates, given by

$B_{t+1} = B_t + \Lambda, \quad \Lambda = \frac{s_t^\top y_t - s_t^\top B_t s_t}{\lVert s_t \rVert_4^4} \mathrm{Diag}(s_t^2),$

enforce a “weak” secant condition parameter-wise. Negative or near-zero eigenvalues are addressed using the rectification

$D_t = \mathrm{rectify}(B_t, \sigma) = \max(|B_t|, \sigma),$

ensuring positive-definite curvature for all updates.

Full Algorithm

Each iteration involves:

Momentum and exponential moving average (EMA) for gradients.
Diagonal quasi-Newton update of $B_t$ as above.
Gradient preconditioning by $D_t^{-1}$ .
Parameter update $\theta_{t+1} = \theta_t - \eta_{t+1} d_{t+1}$ with $d_{t+1}=D_{t+1}^{-1} m_{t+1}$ .

Memory overhead is $O(4d)$ vs. $O(3d)$ for Adam, $O(2d)$ for SGD.

Convergence and Generalization

In convex online settings, APOLLO achieves $O(\sqrt{T})$ regret, matching adaptive optimizers such as Adam and RMSProp. In nonconvex stochastic optimization, APOLLO attains expected gradient norm minimization rates of $O(\log T/\sqrt{T})$ , also matching Adam-type algorithms. The scale hyperparameters $\eta$ and $\sigma$ are coupled, simplifying tuning.

Across CIFAR-10 (ResNet-110), ImageNet (ResNeXt-50), One Billion Word (2-layer LSTM), and WMT’14 En→De (Transformer), APOLLO delivers fast convergence and competitive or superior test accuracy, test perplexity, or BLEU versus SGD, Adam, RAdam, AdaBelief, and AdaHessian.

3. APOLLO-MILP: Alternating Prediction-Correction for Mixed-Integer Linear Programming

In mixed-integer linear programming (MILP), the APOLLO framework (Liu et al., 3 Mar 2025) refers to Apollo-MILP, which combines machine learning–driven prediction with combinatorial correction steps, iteratively reducing problem size while preserving solution quality and feasibility.

Predict-and-Correct Workflow

At each iteration $k$ , maintain reduced instance $\mathcal{I}^{(k)}$ with subset of variables $P^{(k)}$ fixed:

Prediction: Use GNN-based predictor $p_\theta$ to compute marginal probabilities for unfixed variables. Extract partial solution $\hat{x}^{(k)}[P]$ via thresholding $k_1$ largest and $k_0$ smallest.
Correction: Solve trust-region sub-MILP around $\hat{x}^{(k)}[P]$ ,

$\min_x c^\top x \quad s.t.\quad Ax \leq b,\, \ell \leq x \leq u,\, \|x[P] - \hat{x}[P]\|_1 \leq \Delta,$

to obtain a reference solution $\tilde{x}^{(k)}$ .

Fixing/Reduction: Fix indices where $\hat{x}_i = \tilde{x}_i$ (high confidence), then update $P^{(k+1)}$ , further reducing $\mathcal{I}^{(k+1)}$ .

This cycle continues until all integer variables are fixed or resource limits are met.

Uncertainty-Based Error Upper Bound (UEBO)

The uncertainty of predictions is evaluated through an upper bound on the (intractable) KL divergence between predicted and true marginals: $D_{KL}\bigl(p_\theta(x_i|\mathcal{I}) \| q(x_i|\mathcal{I})\bigr) \leq H(p_\theta(x_i|\mathcal{I})) + d(p_\theta, q),$ where $H(\cdot)$ is entropy and $d(\cdot,\cdot)$ is prediction-correction discrepancy. In practice, single-point prediction-correction consistency $\mathbb{I}[\hat{x}_i=\tilde{x}_i]$ is used to guide variable fixing.

Empirical Results and Guarantees

On combinatorial auctions, set covering, item placement, and workload apportionment benchmarks, Apollo-MILP achieves up to 77.8% reduction in absolute gap versus Gurobi, 70.2% versus SCIP, and closes 100% of the gap on hard real-world datasets. The algorithm maintains feasibility at each reduction step and achieves faster and lower primal gap convergence versus all tested baselines.

4. Comparison of Methodologies and Domains

Variant	Domain	Main Technique	Core Memory/Complexity Benefit
APOLLO (LLM)	Deep learning	Channel/tensor-wise scaling via low-rank random projection	$O(nr)$ or $O(n)$ optimizer state; no SVD
APOLLO (quasi-Newton)	Deep learning	Diagonal quasi-Newton with rectified update	$O(4d)$ state; diagonal curvature adaptation
APOLLO-MILP	MILP/combinatorial	Alternating ML prediction and trust-region correction	Problem-size reduction while preserving feasibility

While all variants address scaling—memory or combinatorial dimension—they are specialized for different mathematical problem classes: APOLLO for LLMs prioritizes memory-optimal adaptive learning; APOLLO (quasi-Newton) targets curvature adaptation in stochastic optimization; APOLLO-MILP applies prediction-correction with uncertainty quantification for MILP.

5. Practical Implications and System-Level Benefits

The memory and computational gains of APOLLO (LLM) have significant practical consequences:

Training of foundation models (e.g., LLaMA-13B) becomes feasible on mainstream (A100-80GB) hardware without aggressive model sharding or optimizer state offloading.
On low-end GPUs, pre-training models up to 7B parameters with INT8 quantization is achieved using a single 12GB card.
Batch size scaling is unlocked: APOLLO enables up to 3× throughput versus AdamW owing to optimizer state reduction and removal of SVD-induced stalls (GaLore/Fira).

APOLLO-MILP’s predictor-corrector logic yields substantial problem-size reductions and improved solution quality without risking infeasibility—a critical property for industrial and scientific MILP deployments.

6. Limitations, Compatibility, and Hyperparameter Tuning

APOLLO (LLM) requires choosing subspace rank $r$ ; low $r$ (APOLLO-Mini) suffices for most settings but extremely aggressive compression may trade off some adaptivity.
No hyper-parameter tuning beyond AdamW defaults is generally required; APOLLO operates robustly with $\alpha=1$ or $\sqrt{n/r}$ .
The framework is compatible with activation checkpointing, ZeRO, quantization, and arbitrary large-batch distributed training.
APOLLO-MILP presupposes access to high-performance MILP solvers for trust-region correction and requires ML predictor pretraining.

In summary, the APOLLO optimizer family provides complementary, scalable approaches to training efficiency and high-quality solution finding in both deep neural learning and combinatorial domains, anchoring their distinctive advantages on principled memory reduction, statistical uncertainty, and scalable correction mechanisms (Zhu et al., 6 Dec 2024, Ma, 2020, Liu et al., 3 Mar 2025).

PDF Markdown Chat (Pro)

References (3)

APOLLO: SGD-like Memory, AdamW-level Performance (2024)

Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization (2020)

Apollo-MILP: An Alternating Prediction-Correction Neural Solving Framework for Mixed-Integer Linear Programming (2025)

Follow Topic

Get notified by email when new papers are published related to APOLLO Optimizer.