APOLLO Optimizer Innovations
- APOLLO Optimizer is a suite of modern algorithms designed for efficient deep learning training and combinatorial optimization via tailored, scalable techniques.
- It employs innovations such as low-rank random projections and diagonal quasi-Newton updates to reduce memory overhead and computational complexity.
- The MILP variant integrates machine learning predictions with trust-region correction steps, achieving significant gap reductions while preserving feasibility.
The APOLLO Optimizer refers to a set of modern optimization frameworks and algorithms sharing the “APOLLO” designation, advancing state-of-the-art training efficiency, memory scaling, and solution quality for both deep learning and structured combinatorial optimization. Notable APOLLO optimizers include (i) APOLLO for nonconvex stochastic optimization, a diagonal quasi-Newton scheme, and (ii) APOLLO for memory-efficient large-scale LLM pre-training, which achieves near-SGD memory costs while matching AdamW’s convergence. APOLLO also refers to an alternating predictor-corrector framework (Apollo-MILP) for mixed-integer linear programming. Each variant addresses distinct, domain-specific challenges via tailored algorithmic innovations.
1. APOLLO for Memory-Efficient Large-Scale Neural Network Optimization
The “APOLLO” optimizer for LLMs (Zhu et al., 6 Dec 2024) solves the critical bottleneck of optimizer state memory when training transformer-based architectures with towering parameter counts. AdamW requires storage of full first () and second moment () estimates per parameter, incurring a $2mn$-float overhead for parameter matrices . APOLLO approximates such element-wise adaptive scaling by tracking first and second moment statistics in a much lower-dimensional subspace, leveraging structured channel-wise scaling and random projection for efficiency.
Algorithmic Structure
For parameter matrix and batch gradient , APOLLO executes the following core steps:
- Project gradient: , , sampled as a Gaussian random projection, .
- Update AdamW-like low-rank moments: .
- Compute channel-wise scaling: for each , using .
- Apply adaptive update: , where , corrects random projection bias.
A “rank-1” variant, APOLLO-Mini, collapses the projection to , producing a single scaling factor per parameter matrix and reducing the optimizer state to $2n+2$ floats, matching that of SGD with momentum.
Memory and Computational Analysis
| Optimizer | Optimizer state size | SVD required | Per-iteration complexity |
|---|---|---|---|
| AdamW | $2mn$ | No | |
| GaLore/Fira | (plus SVD) | Yes | per SVD |
| APOLLO | $2nr + 2$ | No | (projection/update) |
| APOLLO-Mini | $2n + 2$ | No |
APOLLO eliminates full-matrix moment tracking and periodic SVD costs (as in GaLore or Fira), maintaining computational simplicity and parallelizability.
Empirical Performance and System Impact
Across LLaMA pre-training experiments (7B/13B models), APOLLO matches or exceeds AdamW in perplexity, enables 3× larger batch sizes (and corresponding throughput) on the same GPUs, and permits naively distributed (DDP) or single-GPU 7B pre-training under 12GB of memory with INT8 quantization. Performance remains robust even at (APOLLO-Mini), where the optimizer state cost approaches zero.
APOLLO’s channel-wise scaling empirically matches or slightly outperforms full element-wise AdamW on language modeling and fine-tuning (e.g., MMLU), even under aggressive memory constraints.
2. APOLLO: Diagonal Quasi-Newton for Stochastic Nonconvex Optimization
A separate “APOLLO” optimizer targets nonconvex stochastic objectives in deep learning (Ma, 2020), introducing a diagonally parameterized quasi-Newton update for scalable curvature adaptation.
Core Update Rule
Classical quasi-Newton algorithms (e.g., BFGS) maintain a dense Hessian approximation, which is infeasible () and potentially indefinite for high-dimensional, nonconvex objectives. APOLLO’s diagonal approximation updates, given by
enforce a “weak” secant condition parameter-wise. Negative or near-zero eigenvalues are addressed using the rectification
ensuring positive-definite curvature for all updates.
Full Algorithm
Each iteration involves:
- Momentum and exponential moving average (EMA) for gradients.
- Diagonal quasi-Newton update of as above.
- Gradient preconditioning by .
- Parameter update with .
Memory overhead is vs. for Adam, for SGD.
Convergence and Generalization
In convex online settings, APOLLO achieves regret, matching adaptive optimizers such as Adam and RMSProp. In nonconvex stochastic optimization, APOLLO attains expected gradient norm minimization rates of , also matching Adam-type algorithms. The scale hyperparameters and are coupled, simplifying tuning.
Across CIFAR-10 (ResNet-110), ImageNet (ResNeXt-50), One Billion Word (2-layer LSTM), and WMT’14 En→De (Transformer), APOLLO delivers fast convergence and competitive or superior test accuracy, test perplexity, or BLEU versus SGD, Adam, RAdam, AdaBelief, and AdaHessian.
3. APOLLO-MILP: Alternating Prediction-Correction for Mixed-Integer Linear Programming
In mixed-integer linear programming (MILP), the APOLLO framework (Liu et al., 3 Mar 2025) refers to Apollo-MILP, which combines machine learning–driven prediction with combinatorial correction steps, iteratively reducing problem size while preserving solution quality and feasibility.
Predict-and-Correct Workflow
At each iteration , maintain reduced instance with subset of variables fixed:
- Prediction: Use GNN-based predictor to compute marginal probabilities for unfixed variables. Extract partial solution via thresholding largest and smallest.
- Correction: Solve trust-region sub-MILP around ,
to obtain a reference solution .
- Fixing/Reduction: Fix indices where (high confidence), then update , further reducing .
This cycle continues until all integer variables are fixed or resource limits are met.
Uncertainty-Based Error Upper Bound (UEBO)
The uncertainty of predictions is evaluated through an upper bound on the (intractable) KL divergence between predicted and true marginals: where is entropy and is prediction-correction discrepancy. In practice, single-point prediction-correction consistency is used to guide variable fixing.
Empirical Results and Guarantees
On combinatorial auctions, set covering, item placement, and workload apportionment benchmarks, Apollo-MILP achieves up to 77.8% reduction in absolute gap versus Gurobi, 70.2% versus SCIP, and closes 100% of the gap on hard real-world datasets. The algorithm maintains feasibility at each reduction step and achieves faster and lower primal gap convergence versus all tested baselines.
4. Comparison of Methodologies and Domains
| Variant | Domain | Main Technique | Core Memory/Complexity Benefit |
|---|---|---|---|
| APOLLO (LLM) | Deep learning | Channel/tensor-wise scaling via low-rank random projection | or optimizer state; no SVD |
| APOLLO (quasi-Newton) | Deep learning | Diagonal quasi-Newton with rectified update | state; diagonal curvature adaptation |
| APOLLO-MILP | MILP/combinatorial | Alternating ML prediction and trust-region correction | Problem-size reduction while preserving feasibility |
While all variants address scaling—memory or combinatorial dimension—they are specialized for different mathematical problem classes: APOLLO for LLMs prioritizes memory-optimal adaptive learning; APOLLO (quasi-Newton) targets curvature adaptation in stochastic optimization; APOLLO-MILP applies prediction-correction with uncertainty quantification for MILP.
5. Practical Implications and System-Level Benefits
The memory and computational gains of APOLLO (LLM) have significant practical consequences:
- Training of foundation models (e.g., LLaMA-13B) becomes feasible on mainstream (A100-80GB) hardware without aggressive model sharding or optimizer state offloading.
- On low-end GPUs, pre-training models up to 7B parameters with INT8 quantization is achieved using a single 12GB card.
- Batch size scaling is unlocked: APOLLO enables up to 3× throughput versus AdamW owing to optimizer state reduction and removal of SVD-induced stalls (GaLore/Fira).
APOLLO-MILP’s predictor-corrector logic yields substantial problem-size reductions and improved solution quality without risking infeasibility—a critical property for industrial and scientific MILP deployments.
6. Limitations, Compatibility, and Hyperparameter Tuning
- APOLLO (LLM) requires choosing subspace rank ; low (APOLLO-Mini) suffices for most settings but extremely aggressive compression may trade off some adaptivity.
- No hyper-parameter tuning beyond AdamW defaults is generally required; APOLLO operates robustly with or .
- The framework is compatible with activation checkpointing, ZeRO, quantization, and arbitrary large-batch distributed training.
- APOLLO-MILP presupposes access to high-performance MILP solvers for trust-region correction and requires ML predictor pretraining.
In summary, the APOLLO optimizer family provides complementary, scalable approaches to training efficiency and high-quality solution finding in both deep neural learning and combinatorial domains, anchoring their distinctive advantages on principled memory reduction, statistical uncertainty, and scalable correction mechanisms (Zhu et al., 6 Dec 2024, Ma, 2020, Liu et al., 3 Mar 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free