Sophia Optimizer: Scalable Second-Order Method

Updated 17 January 2026

Sophia Optimizer is a second-order stochastic method that uses inexpensive diagonal Hessian estimates for robust, adaptive updates in large-scale language models.
It combines momentum tracking, periodic Hessian estimation, and clipped preconditioned steps to ensure stability and accelerate convergence in anisotropic loss landscapes.
Empirical results show Sophia significantly reduces optimization steps and compute time (e.g., halving steps for GPT-2 770M) while achieving lower validation loss compared to AdamW and Lion.

Sophia is a second-order stochastic optimizer tailored for large-scale LLM pre-training, designed to exploit inexpensive diagonal Hessian information for adaptive, robust, and computationally efficient parameter updates. The method achieves per-step cost close to first-order algorithms such as AdamW, while introducing component-wise second-order scaling through online diagonal Hessian or Gauss–Newton approximations, modulated by entry-wise update clipping for stability. Sophia has demonstrated substantial reductions in the number of optimization steps, compute, and wall-clock time to reach standard perplexity thresholds on GPT-family models, positioning it as a practical alternative to both first-order and heavy-weight second-order methods in the LLM regime (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).

1. Algorithmic Formulation

Sophia combines three principal mechanisms: first-moment (momentum) tracking, exponentially averaged diagonal Hessian estimation, and clipped preconditioned updates. Let $\theta_t\in \mathbb{R}^n$ denote the parameter vector at step $t$ and $g_t = \nabla f_t(\theta_{t-1})$ the stochastic gradient from the current batch.

Momentum accumulation:

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t$

Diagonal Hessian estimation (updated every $k$ steps, typically $k=1$ or $10$):

$\hat h_t = \mathrm{Est}(\theta_{t-1}) \ h_t = \beta_2 h_{t-k} + (1-\beta_2) \hat h_t \quad \textrm{if}~t\!\!\!\!\mod k = 0$

where Est is a low-cost estimator such as Hutchinson or Gauss-Newton-Bartlett.

Decoupled weight decay (as in AdamW):

$\theta'_t = \theta_{t-1} - \eta_t \lambda \theta_{t-1}$

Clipped, curvature-normalized step:

$\theta_t = \theta'_t - \eta_t \cdot \mathrm{clip}\left( \frac{m_t}{\max(\gamma h_t, \epsilon)},\, 1 \right)$

The element-wise clip restricts each coordinate to $[-1, 1]$ , mitigating instability due to spurious curvatures, noisy Hessian estimates, or saddle points.

A generic pseudocode summarizing these steps:

Input: β1, β2, k, learning rate schedule η_t, λ, γ, ε, Est ∈ {Hutchinson, GN-Bartlett}
Initialize θ0 ∈ ℝⁿ, m0 ← 0, h_k ← 0
while not converged:
  t ← t + 1
  g_t ← ∇f_t(θ_{t-1})
  m_t ← β1 · m_{t-1} + (1−β1) · g_t
  if t mod k == 0:
      ĥ_t ← Est(θ_{t-1})
      h_t ← β2 · h_{t−k} + (1−β2) · ĥ_t
  else:
      h_t ← h_{t−1}
  θ'_t ← θ_{t-1} − η_t · λ · θ_{t-1}
  θ_t ← θ'_t − η_t · clip( m_t / max(γ · h_t, ε), 1 )

Key hyperparameters include the learning rate

\eta_t

, momentum factors

(\beta_1, \beta_2)

, Hessian update period

k

, weight decay

\lambda

, stability constant

\epsilon

, and clipping parameter

\gamma

or, equivalently, the clipping rate

\delta

(Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).

2. Theoretical Motivations and Underlying Principles

Sophia's update can be interpreted as an approximate diagonal Newton (or Gauss–Newton) step:

By dividing the (smoothed) gradient by Hessian estimates, Sophia adapts the step size per dimension, accelerating convergence, especially in anisotropic loss landscapes common in transformers (Liu et al., 2023).
Element-wise clipping directly guards against divergence caused by outlier curvature estimates or non-convex regions where Hessian signs may fluctuate. This replaces the need for trust-region logic typical in traditional second-order schemes.
Theoretical regime: In convex problems with multiplicative Hessian Lipschitzness, clipped Newton-like step yields convergence rates independent of the condition number, with phase transition from O( $1/\epsilon$ ) to quadratic convergence in the local regime (Liu et al., 2023).

3. Practical Implementation and Complexity

Estimator Choices and Overhead

Sophia leverages two diagonal Hessian approximators:

Hutchinson's estimator: uses random projections to obtain unbiased but noisy diagonal estimates.
Gauss–Newton–Bartlett: computes the squared gradients on resampled labels to obtain a PSD and more stable curvature estimate.

Implementation requires maintaining two auxiliary $n$ -vectors ( $m, h$ ) per parameter, matching AdamW’s memory cost. Hessian computation overhead is amortized by running the estimator infrequently, e.g., every $k=10$ steps on a minibatch subset, resulting in $\sim$ 5–6% per-step compute/time increase over first-order methods in large-scale LLMs (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025).

Default and Tuned Hyperparameters

Typical settings (for GPT-2 or LLaMA 2.7B-sized models transitioning from μP-proxy tuning) include: | Hyperparameter | Value | |-------------------|------------------| | Learning rate | 0.001–0.002 | | β₁ | 0.96 | | β₂ | 0.99 | | ε | 10^{{-15}–10^{-12}|} | γ (ρ) | 0.3–1.0 | | Weight decay λ | 0.2 | | Hessian period k | 1–10 |

Hyperparameter transfer, according to μP principles, is robust across model scales but sensitive to mismatched learning rate or $\rho$ (fraction of allowed unclipped updates) (Schlotthauer et al., 11 Jul 2025).

4. Empirical Performance and Comparative Analysis

Large-Scale Language Modeling

Sophia achieves equivalent or lower validation perplexity than AdamW and Lion in both single-epoch and repeated-epoch LLM pre-training regimes, particularly on decoder-only GPT models under compute-constrained schedules (Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).
Wall-clock efficiency: Sophia requires half the steps compared to AdamW; owing to modest per-step overhead, overall compute and time are nearly halved for reaching target perplexity. This effect becomes more pronounced as model scale grows. For example, for GPT-2 770M models, Sophia–H/G reduced steps from 400K (AdamW) to 200K, with +5–6% wall-time per step (Liu et al., 2023).
On downstream evaluation (zero-shot transfer tasks: ARC, HellaSwag, MMLU), AdamW retains the highest average accuracy; Sophia’s final validation loss is lowest but at a cost of $\sim$ 1–2 pp lower average zero-shot accuracy under tight compute budgets (Schlotthauer et al., 11 Jul 2025).

Robustness and Sensitivity

Sophia is robust to moderate hyperparameter misspecification under μP scaling but exhibits performance drops if optimal $\rho$ or learning rate are missed; less robust to out-of-distribution hyperparameters than AdamW (Schlotthauer et al., 11 Jul 2025).
In small-scale experiments (e.g., 150M transformer pre-training), Sophia consistently matches but does not outperform simpler sign-momentum (Signum), raising questions about the necessity of curvature estimation in this regime (Zhao et al., 2024).

5. Comparison to Other Optimizers

Optimizer	Core Mechanism	Overhead	Typical Performance
SGD	Fixed stepsize, no adaptivity	Low	Slow convergence, high sensitivity
AdamW	EMA of squared gradients, decoupled WD	Medium	Fast initial convergence, strong generalization
Lion	Sign-based momentum, no curvature	Low	Fastest wall-clock, early convergence, higher plateauing loss
Sophia	Diagonal Hessian EMA, clipped preconditioned	Medium (5–6%)	Lowest loss in multi-epoch, fastest overall compute efficiency, moderate downstream generalization

Sophia uniquely interpolates between computationally efficient first-order adaptation (AdamW) and robust, geometry-aware second-order updates (full Newton/AdaHessian), without the cost of matrix inversion, block structure, or frequent Hessian-vector products (Liu et al., 2023, Schlotthauer et al., 11 Jul 2025, Altınel, 22 Sep 2025).

6. Limitations and Open Challenges

Off-diagonal curvature: Sophia exclusively leverages the diagonal of the Hessian; cross-parameter interactions are ignored. A plausible implication is reduced effectiveness in highly non-separable parameter regimes or small-batch, ill-conditioned vision/RL objectives (Liu et al., 2023).
Downstream generalization: While Sophia often minimizes training/validation loss fastest, AdamW generalizes better for downstream zero-shot and transfer performance on conventional LLM benchmarks (Schlotthauer et al., 11 Jul 2025).
Estimator choice and stability: Hutchinson estimators, though unbiased, can be noisy; Gauss–Newton–Bartlett is more stable but has bias, which may matter outside the language modeling regime (Altınel, 22 Sep 2025).
Transfer to non-LLM domains: Sophia’s hyperparameters, update logic, and stability requirements may require careful re-tuning for vision or reinforcement learning tasks, where the loss landscape and batch regime differ substantially (Liu et al., 2023, Altınel, 22 Sep 2025).
Small-scale empirical benefit: In certain settings (e.g., 150M transformer pre-training), Sophia does not outperform simpler sign-momentum or Lion optimizers (Zhao et al., 2024), suggesting high-curvature adaptivity is less critical at these scales.

7. Best Practices and Practical Guidance

Warmup and schedule: Use linear or cosine warmup to the peak learning rate, with subsequent linear/cosine decay, mirroring AdamW best-practice.
Clipping parameter calibration: Run small proxy models, monitor the fraction of unclipped updates $|m_t/(γ h_t)|<1$ ; ideally target 10–50%.
Gradient estimator selection: Start with Gauss–Newton–Bartlett on autoregressive LLMs; fallback to Hutchinson for non-LLM or custom architectures.
Hessian update frequency: $k=10$ suffices for good trade-off; $k=1$ offers maximal fidelity at $\sim$ 50% higher compute cost.
No need for model modifications: Sophia is compatible with canonical transformer modules—no temperature scaling, layerwise learning rates, or architecture-specific wiring required (Liu et al., 2023).
Compute/memory overhead: With $k=10$ , per-step memory matches AdamW; wall-clock time is increased by 5–6%. For memory-constrained runs, increase $k$ or downsample the Hessian batch (Schlotthauer et al., 11 Jul 2025).

References

"Sophia: A Scalable Stochastic Second-order Optimizer for LLM Pre-training" (Liu et al., 2023)
"Pre-Training LLMs on a budget: A comparison of three optimizers" (Schlotthauer et al., 11 Jul 2025)
"Development of Deep Learning Optimizers: Approaches, Concepts, and Update Rules" (Altınel, 22 Sep 2025)
"Deconstructing What Makes a Good Optimizer for LLMs" (Zhao et al., 2024)

Markdown Upgrade to Chat

References (4)

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training (2023)

Pre-Training LLMs on a budget: A comparison of three optimizers (2025)

Development of Deep Learning Optimizers: Approaches, Concepts, and Update Rules (2025)

Deconstructing What Makes a Good Optimizer for Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sophia Optimizer.

Sophia Optimizer: Scalable Second-Order Method

1. Algorithmic Formulation

2. Theoretical Motivations and Underlying Principles

3. Practical Implementation and Complexity

Estimator Choices and Overhead

Default and Tuned Hyperparameters

4. Empirical Performance and Comparative Analysis

Large-Scale Language Modeling

Robustness and Sensitivity

5. Comparison to Other Optimizers

6. Limitations and Open Challenges

7. Best Practices and Practical Guidance

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Sophia Optimizer: Scalable Second-Order Method

1. Algorithmic Formulation

2. Theoretical Motivations and Underlying Principles

3. Practical Implementation and Complexity

Estimator Choices and Overhead

Default and Tuned Hyperparameters

4. Empirical Performance and Comparative Analysis

Large-Scale Language Modeling

Robustness and Sensitivity

5. Comparison to Other Optimizers

6. Limitations and Open Challenges

7. Best Practices and Practical Guidance

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research