Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sophia Optimizer: Scalable Second-Order Method

Updated 17 January 2026
  • Sophia Optimizer is a second-order stochastic method that uses inexpensive diagonal Hessian estimates for robust, adaptive updates in large-scale language models.
  • It combines momentum tracking, periodic Hessian estimation, and clipped preconditioned steps to ensure stability and accelerate convergence in anisotropic loss landscapes.
  • Empirical results show Sophia significantly reduces optimization steps and compute time (e.g., halving steps for GPT-2 770M) while achieving lower validation loss compared to AdamW and Lion.

Sophia is a second-order stochastic optimizer tailored for large-scale LLM pre-training, designed to exploit inexpensive diagonal Hessian information for adaptive, robust, and computationally efficient parameter updates. The method achieves per-step cost close to first-order algorithms such as AdamW, while introducing component-wise second-order scaling through online diagonal Hessian or Gauss–Newton approximations, modulated by entry-wise update clipping for stability. Sophia has demonstrated substantial reductions in the number of optimization steps, compute, and wall-clock time to reach standard perplexity thresholds on GPT-family models, positioning it as a practical alternative to both first-order and heavy-weight second-order methods in the LLM regime (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).

1. Algorithmic Formulation

Sophia combines three principal mechanisms: first-moment (momentum) tracking, exponentially averaged diagonal Hessian estimation, and clipped preconditioned updates. Let θtRn\theta_t\in \mathbb{R}^n denote the parameter vector at step tt and gt=ft(θt1)g_t = \nabla f_t(\theta_{t-1}) the stochastic gradient from the current batch.

  • Momentum accumulation:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

  • Diagonal Hessian estimation (updated every kk steps, typically k=1k=1 or $10$):

h^t=Est(θt1) ht=β2htk+(1β2)h^tif t ⁣ ⁣ ⁣ ⁣modk=0\hat h_t = \mathrm{Est}(\theta_{t-1}) \ h_t = \beta_2 h_{t-k} + (1-\beta_2) \hat h_t \quad \textrm{if}~t\!\!\!\!\mod k = 0

where Est is a low-cost estimator such as Hutchinson or Gauss-Newton-Bartlett.

θt=θt1ηtλθt1\theta'_t = \theta_{t-1} - \eta_t \lambda \theta_{t-1}

  • Clipped, curvature-normalized step:

θt=θtηtclip(mtmax(γht,ϵ),1)\theta_t = \theta'_t - \eta_t \cdot \mathrm{clip}\left( \frac{m_t}{\max(\gamma h_t, \epsilon)},\, 1 \right)

The element-wise clip restricts each coordinate to [1,1][-1, 1], mitigating instability due to spurious curvatures, noisy Hessian estimates, or saddle points.

A generic pseudocode summarizing these steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
Input: β1, β2, k, learning rate schedule η_t, λ, γ, ε, Est  {Hutchinson, GN-Bartlett}
Initialize θ0  ℝⁿ, m0  0, h_k  0
while not converged:
  t  t + 1
  g_t  f_t(θ_{t-1})
  m_t  β1 · m_{t-1} + (1β1) · g_t
  if t mod k == 0:
      ĥ_t  Est(θ_{t-1})
      h_t  β2 · h_{tk} + (1β2) · ĥ_t
  else:
      h_t  h_{t1}
  θ'_t ← θ_{t-1} − η_t · λ · θ_{t-1}
  θ_t  θ'_t − η_t · clip( m_t / max(γ · h_t, ε), 1 )
Key hyperparameters include the learning rate ηt\eta_t, momentum factors (β1,β2)(\beta_1, \beta_2), Hessian update period kk, weight decay λ\lambda, stability constant ϵ\epsilon, and clipping parameter γ\gamma or, equivalently, the clipping rate δ\delta (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).

2. Theoretical Motivations and Underlying Principles

Sophia's update can be interpreted as an approximate diagonal Newton (or Gauss–Newton) step:

  • By dividing the (smoothed) gradient by Hessian estimates, Sophia adapts the step size per dimension, accelerating convergence, especially in anisotropic loss landscapes common in transformers (Liu et al., 2023).
  • Element-wise clipping directly guards against divergence caused by outlier curvature estimates or non-convex regions where Hessian signs may fluctuate. This replaces the need for trust-region logic typical in traditional second-order schemes.
  • Theoretical regime: In convex problems with multiplicative Hessian Lipschitzness, clipped Newton-like step yields convergence rates independent of the condition number, with phase transition from O(1/ϵ1/\epsilon) to quadratic convergence in the local regime (Liu et al., 2023).

3. Practical Implementation and Complexity

Estimator Choices and Overhead

Sophia leverages two diagonal Hessian approximators:

  • Hutchinson's estimator: uses random projections to obtain unbiased but noisy diagonal estimates.
  • Gauss–Newton–Bartlett: computes the squared gradients on resampled labels to obtain a PSD and more stable curvature estimate.

Implementation requires maintaining two auxiliary nn-vectors (m,hm, h) per parameter, matching AdamW’s memory cost. Hessian computation overhead is amortized by running the estimator infrequently, e.g., every k=10k=10 steps on a minibatch subset, resulting in \sim5–6% per-step compute/time increase over first-order methods in large-scale LLMs (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025).

Default and Tuned Hyperparameters

Typical settings (for GPT-2 or LLaMA 2.7B-sized models transitioning from μP-proxy tuning) include: | Hyperparameter | Value | |-------------------|------------------| | Learning rate | 0.001–0.002 | | β₁ | 0.96 | | β₂ | 0.99 | | ε | 10{-15}–10{-12}| | γ (ρ) | 0.3–1.0 | | Weight decay λ | 0.2 | | Hessian period k | 1–10 |

Hyperparameter transfer, according to μP principles, is robust across model scales but sensitive to mismatched learning rate or ρ\rho (fraction of allowed unclipped updates) (Schlotthauer et al., 11 Jul 2025).

4. Empirical Performance and Comparative Analysis

Large-Scale Language Modeling

  • Sophia achieves equivalent or lower validation perplexity than AdamW and Lion in both single-epoch and repeated-epoch LLM pre-training regimes, particularly on decoder-only GPT models under compute-constrained schedules (Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).
  • Wall-clock efficiency: Sophia requires half the steps compared to AdamW; owing to modest per-step overhead, overall compute and time are nearly halved for reaching target perplexity. This effect becomes more pronounced as model scale grows. For example, for GPT-2 770M models, Sophia–H/G reduced steps from 400K (AdamW) to 200K, with +5–6% wall-time per step (Liu et al., 2023).
  • On downstream evaluation (zero-shot transfer tasks: ARC, HellaSwag, MMLU), AdamW retains the highest average accuracy; Sophia’s final validation loss is lowest but at a cost of \sim1–2 pp lower average zero-shot accuracy under tight compute budgets (Schlotthauer et al., 11 Jul 2025).

Robustness and Sensitivity

  • Sophia is robust to moderate hyperparameter misspecification under μP scaling but exhibits performance drops if optimal ρ\rho or learning rate are missed; less robust to out-of-distribution hyperparameters than AdamW (Schlotthauer et al., 11 Jul 2025).
  • In small-scale experiments (e.g., 150M transformer pre-training), Sophia consistently matches but does not outperform simpler sign-momentum (Signum), raising questions about the necessity of curvature estimation in this regime (Zhao et al., 2024).

5. Comparison to Other Optimizers

Optimizer Core Mechanism Overhead Typical Performance
SGD Fixed stepsize, no adaptivity Low Slow convergence, high sensitivity
AdamW EMA of squared gradients, decoupled WD Medium Fast initial convergence, strong generalization
Lion Sign-based momentum, no curvature Low Fastest wall-clock, early convergence, higher plateauing loss
Sophia Diagonal Hessian EMA, clipped preconditioned Medium (5–6%) Lowest loss in multi-epoch, fastest overall compute efficiency, moderate downstream generalization

Sophia uniquely interpolates between computationally efficient first-order adaptation (AdamW) and robust, geometry-aware second-order updates (full Newton/AdaHessian), without the cost of matrix inversion, block structure, or frequent Hessian-vector products (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).

6. Limitations and Open Challenges

  • Off-diagonal curvature: Sophia exclusively leverages the diagonal of the Hessian; cross-parameter interactions are ignored. A plausible implication is reduced effectiveness in highly non-separable parameter regimes or small-batch, ill-conditioned vision/RL objectives (Liu et al., 2023).
  • Downstream generalization: While Sophia often minimizes training/validation loss fastest, AdamW generalizes better for downstream zero-shot and transfer performance on conventional LLM benchmarks (Schlotthauer et al., 11 Jul 2025).
  • Estimator choice and stability: Hutchinson estimators, though unbiased, can be noisy; Gauss–Newton–Bartlett is more stable but has bias, which may matter outside the language modeling regime (Altınel, 22 Sep 2025).
  • Transfer to non-LLM domains: Sophia’s hyperparameters, update logic, and stability requirements may require careful re-tuning for vision or reinforcement learning tasks, where the loss landscape and batch regime differ substantially (Liu et al., 2023, Altınel, 22 Sep 2025).
  • Small-scale empirical benefit: In certain settings (e.g., 150M transformer pre-training), Sophia does not outperform simpler sign-momentum or Lion optimizers (Zhao et al., 2024), suggesting high-curvature adaptivity is less critical at these scales.

7. Best Practices and Practical Guidance

  • Warmup and schedule: Use linear or cosine warmup to the peak learning rate, with subsequent linear/cosine decay, mirroring AdamW best-practice.
  • Clipping parameter calibration: Run small proxy models, monitor the fraction of unclipped updates mt/(γht)<1|m_t/(γ h_t)|<1; ideally target 10–50%.
  • Gradient estimator selection: Start with Gauss–Newton–Bartlett on autoregressive LLMs; fallback to Hutchinson for non-LLM or custom architectures.
  • Hessian update frequency: k=10k=10 suffices for good trade-off; k=1k=1 offers maximal fidelity at \sim50% higher compute cost.
  • No need for model modifications: Sophia is compatible with canonical transformer modules—no temperature scaling, layerwise learning rates, or architecture-specific wiring required (Liu et al., 2023).
  • Compute/memory overhead: With k=10k=10, per-step memory matches AdamW; wall-clock time is increased by 5–6%. For memory-constrained runs, increase kk or downsample the Hessian batch (Schlotthauer et al., 11 Jul 2025).

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sophia Optimizer.