Sophia Optimizer: Scalable Second-Order Method
- Sophia Optimizer is a second-order stochastic method that uses inexpensive diagonal Hessian estimates for robust, adaptive updates in large-scale language models.
- It combines momentum tracking, periodic Hessian estimation, and clipped preconditioned steps to ensure stability and accelerate convergence in anisotropic loss landscapes.
- Empirical results show Sophia significantly reduces optimization steps and compute time (e.g., halving steps for GPT-2 770M) while achieving lower validation loss compared to AdamW and Lion.
Sophia is a second-order stochastic optimizer tailored for large-scale LLM pre-training, designed to exploit inexpensive diagonal Hessian information for adaptive, robust, and computationally efficient parameter updates. The method achieves per-step cost close to first-order algorithms such as AdamW, while introducing component-wise second-order scaling through online diagonal Hessian or Gauss–Newton approximations, modulated by entry-wise update clipping for stability. Sophia has demonstrated substantial reductions in the number of optimization steps, compute, and wall-clock time to reach standard perplexity thresholds on GPT-family models, positioning it as a practical alternative to both first-order and heavy-weight second-order methods in the LLM regime (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).
1. Algorithmic Formulation
Sophia combines three principal mechanisms: first-moment (momentum) tracking, exponentially averaged diagonal Hessian estimation, and clipped preconditioned updates. Let denote the parameter vector at step and the stochastic gradient from the current batch.
- Momentum accumulation:
- Diagonal Hessian estimation (updated every steps, typically or $10$):
where Est is a low-cost estimator such as Hutchinson or Gauss-Newton-Bartlett.
- Decoupled weight decay (as in AdamW):
- Clipped, curvature-normalized step:
The element-wise clip restricts each coordinate to , mitigating instability due to spurious curvatures, noisy Hessian estimates, or saddle points.
A generic pseudocode summarizing these steps:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Input: β1, β2, k, learning rate schedule η_t, λ, γ, ε, Est ∈ {Hutchinson, GN-Bartlett} Initialize θ0 ∈ ℝⁿ, m0 ← 0, h_k ← 0 while not converged: t ← t + 1 g_t ← ∇f_t(θ_{t-1}) m_t ← β1 · m_{t-1} + (1−β1) · g_t if t mod k == 0: ĥ_t ← Est(θ_{t-1}) h_t ← β2 · h_{t−k} + (1−β2) · ĥ_t else: h_t ← h_{t−1} θ'_t ← θ_{t-1} − η_t · λ · θ_{t-1} θ_t ← θ'_t − η_t · clip( m_t / max(γ · h_t, ε), 1 ) |
2. Theoretical Motivations and Underlying Principles
Sophia's update can be interpreted as an approximate diagonal Newton (or Gauss–Newton) step:
- By dividing the (smoothed) gradient by Hessian estimates, Sophia adapts the step size per dimension, accelerating convergence, especially in anisotropic loss landscapes common in transformers (Liu et al., 2023).
- Element-wise clipping directly guards against divergence caused by outlier curvature estimates or non-convex regions where Hessian signs may fluctuate. This replaces the need for trust-region logic typical in traditional second-order schemes.
- Theoretical regime: In convex problems with multiplicative Hessian Lipschitzness, clipped Newton-like step yields convergence rates independent of the condition number, with phase transition from O() to quadratic convergence in the local regime (Liu et al., 2023).
3. Practical Implementation and Complexity
Estimator Choices and Overhead
Sophia leverages two diagonal Hessian approximators:
- Hutchinson's estimator: uses random projections to obtain unbiased but noisy diagonal estimates.
- Gauss–Newton–Bartlett: computes the squared gradients on resampled labels to obtain a PSD and more stable curvature estimate.
Implementation requires maintaining two auxiliary -vectors () per parameter, matching AdamW’s memory cost. Hessian computation overhead is amortized by running the estimator infrequently, e.g., every steps on a minibatch subset, resulting in 5–6% per-step compute/time increase over first-order methods in large-scale LLMs (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025).
Default and Tuned Hyperparameters
Typical settings (for GPT-2 or LLaMA 2.7B-sized models transitioning from μP-proxy tuning) include: | Hyperparameter | Value | |-------------------|------------------| | Learning rate | 0.001–0.002 | | β₁ | 0.96 | | β₂ | 0.99 | | ε | 10{-15}–10{-12}| | γ (ρ) | 0.3–1.0 | | Weight decay λ | 0.2 | | Hessian period k | 1–10 |
Hyperparameter transfer, according to μP principles, is robust across model scales but sensitive to mismatched learning rate or (fraction of allowed unclipped updates) (Schlotthauer et al., 11 Jul 2025).
4. Empirical Performance and Comparative Analysis
Large-Scale Language Modeling
- Sophia achieves equivalent or lower validation perplexity than AdamW and Lion in both single-epoch and repeated-epoch LLM pre-training regimes, particularly on decoder-only GPT models under compute-constrained schedules (Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).
- Wall-clock efficiency: Sophia requires half the steps compared to AdamW; owing to modest per-step overhead, overall compute and time are nearly halved for reaching target perplexity. This effect becomes more pronounced as model scale grows. For example, for GPT-2 770M models, Sophia–H/G reduced steps from 400K (AdamW) to 200K, with +5–6% wall-time per step (Liu et al., 2023).
- On downstream evaluation (zero-shot transfer tasks: ARC, HellaSwag, MMLU), AdamW retains the highest average accuracy; Sophia’s final validation loss is lowest but at a cost of 1–2 pp lower average zero-shot accuracy under tight compute budgets (Schlotthauer et al., 11 Jul 2025).
Robustness and Sensitivity
- Sophia is robust to moderate hyperparameter misspecification under μP scaling but exhibits performance drops if optimal or learning rate are missed; less robust to out-of-distribution hyperparameters than AdamW (Schlotthauer et al., 11 Jul 2025).
- In small-scale experiments (e.g., 150M transformer pre-training), Sophia consistently matches but does not outperform simpler sign-momentum (Signum), raising questions about the necessity of curvature estimation in this regime (Zhao et al., 2024).
5. Comparison to Other Optimizers
| Optimizer | Core Mechanism | Overhead | Typical Performance |
|---|---|---|---|
| SGD | Fixed stepsize, no adaptivity | Low | Slow convergence, high sensitivity |
| AdamW | EMA of squared gradients, decoupled WD | Medium | Fast initial convergence, strong generalization |
| Lion | Sign-based momentum, no curvature | Low | Fastest wall-clock, early convergence, higher plateauing loss |
| Sophia | Diagonal Hessian EMA, clipped preconditioned | Medium (5–6%) | Lowest loss in multi-epoch, fastest overall compute efficiency, moderate downstream generalization |
Sophia uniquely interpolates between computationally efficient first-order adaptation (AdamW) and robust, geometry-aware second-order updates (full Newton/AdaHessian), without the cost of matrix inversion, block structure, or frequent Hessian-vector products (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).
6. Limitations and Open Challenges
- Off-diagonal curvature: Sophia exclusively leverages the diagonal of the Hessian; cross-parameter interactions are ignored. A plausible implication is reduced effectiveness in highly non-separable parameter regimes or small-batch, ill-conditioned vision/RL objectives (Liu et al., 2023).
- Downstream generalization: While Sophia often minimizes training/validation loss fastest, AdamW generalizes better for downstream zero-shot and transfer performance on conventional LLM benchmarks (Schlotthauer et al., 11 Jul 2025).
- Estimator choice and stability: Hutchinson estimators, though unbiased, can be noisy; Gauss–Newton–Bartlett is more stable but has bias, which may matter outside the language modeling regime (Altınel, 22 Sep 2025).
- Transfer to non-LLM domains: Sophia’s hyperparameters, update logic, and stability requirements may require careful re-tuning for vision or reinforcement learning tasks, where the loss landscape and batch regime differ substantially (Liu et al., 2023, Altınel, 22 Sep 2025).
- Small-scale empirical benefit: In certain settings (e.g., 150M transformer pre-training), Sophia does not outperform simpler sign-momentum or Lion optimizers (Zhao et al., 2024), suggesting high-curvature adaptivity is less critical at these scales.
7. Best Practices and Practical Guidance
- Warmup and schedule: Use linear or cosine warmup to the peak learning rate, with subsequent linear/cosine decay, mirroring AdamW best-practice.
- Clipping parameter calibration: Run small proxy models, monitor the fraction of unclipped updates ; ideally target 10–50%.
- Gradient estimator selection: Start with Gauss–Newton–Bartlett on autoregressive LLMs; fallback to Hutchinson for non-LLM or custom architectures.
- Hessian update frequency: suffices for good trade-off; offers maximal fidelity at 50% higher compute cost.
- No need for model modifications: Sophia is compatible with canonical transformer modules—no temperature scaling, layerwise learning rates, or architecture-specific wiring required (Liu et al., 2023).
- Compute/memory overhead: With , per-step memory matches AdamW; wall-clock time is increased by 5–6%. For memory-constrained runs, increase or downsample the Hessian batch (Schlotthauer et al., 11 Jul 2025).
References
- "Sophia: A Scalable Stochastic Second-order Optimizer for LLM Pre-training" (Liu et al., 2023)
- "Pre-Training LLMs on a budget: A comparison of three optimizers" (Schlotthauer et al., 11 Jul 2025)
- "Development of Deep Learning Optimizers: Approaches, Concepts, and Update Rules" (Altınel, 22 Sep 2025)
- "Deconstructing What Makes a Good Optimizer for LLMs" (Zhao et al., 2024)