The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton (2510.09378v1)

Published 10 Oct 2025 in cs.LG and cs.AI

Abstract: Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

Summary

The paper demonstrates that full Gauss-Newton optimization reduces training iterations by up to 16x compared to state-of-the-art first-order methods.
It employs both full and layerwise GN preconditioning, with the layerwise approach achieving near-matching performance at only a 1.4x increase in steps.
Computational overhead is notable at 4-5x slower than standard training, highlighting a trade-off between efficiency and convergence speed.

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Introduction

This paper presents a comprehensive empirical investigation into the efficacy of full second-order optimization, specifically the Gauss-Newton (GN) method, for LLM pretraining. The paper aims to establish practical upper bounds on iteration complexity and sample efficiency by applying full GN preconditioning to transformer models up to 150M parameters. The work benchmarks GN against state-of-the-art optimizers such as AdamW, Muon, and SOAP, and further explores the impact of layerwise and higher-order loss term approximations. The results provide critical insights into the structural properties of the Hessian that are essential for achieving optimal convergence rates and scaling behavior in LLM training.

Background and Motivation

First-order optimizers (SGD, Adam) have dominated LLM training due to their computational efficiency. However, second-order methods theoretically offer faster convergence and improved scaling to large batch sizes, which is increasingly relevant as model and dataset sizes grow. Recent second-order optimizers (Shampoo, SOAP, Muon) employ memory- and compute-efficient approximations of the Hessian, typically at the layer level, but do not utilize full second-order information. This work seeks to quantify the performance gap between these practical approximations and the idealized full GN method, thereby informing future algorithmic development.

Full Gauss-Newton Methodology

The GN method leverages the positive semi-definite (PSD) Gauss-Newton matrix as a preconditioner for gradient updates, circumventing the instability issues associated with the full Hessian in neural networks. The update rule is:

$\theta^* = \theta - G^{-1}g$

where $G$ is the Gauss-Newton matrix and $g$ is the gradient. Direct computation of $G$ is infeasible for large models, so the implementation utilizes Jacobian-vector products and optimizes a second-order Taylor expansion of the loss around a first-order Taylor expansion of the model. The inner optimization is performed using Muon, which empirically outperforms AdamW in this context.

Experimental Setup

Experiments are conducted on 45M and 150M parameter LLaMA models trained on the C4 dataset. Hyperparameter sweeps are performed for all methods, with careful initialization and regularization strategies (weight decay, line search, learning rate schedules) to ensure stable and fair comparisons. The critical batch size is determined for each optimizer, and both iteration complexity (steps to reach target loss) and sample efficiency (final loss at fixed token count) are measured.

Iteration Complexity and Batch Size Scaling

The GN method demonstrates a 5.4x reduction in training iterations compared to SOAP and a 16x reduction compared to Muon at large batch sizes. Notably, GN maintains near-optimal scaling, with minimal loss in sample efficiency as batch size increases, in contrast to AdamW and Muon, which plateau at much lower batch sizes.

Figure 1: Left: Batch size vs final validation loss for models trained for Chinchilla-optimal number of tokens. The dotted line marks the loss achieved by a model trained with Muon with batch size 128k. This represents the upper bound of performance for our Gauss-Newton method. Right: Critical batch size scaling. The dotted line marks the optimal scaling trend, where no sample efficiency is lost as batch size increases.

Figure 2: Left: Batch size vs final validation loss for 45M parameter models. All models are trained for 900M tokens on the C4 dataset. Right: Critical batch size scaling for 45M parameter models.

Layerwise and Higher-Order Loss Term Approximations

A key finding is that a layerwise GN preconditioner, which ignores cross-layer Hessian information, achieves performance nearly matching that of the full GN method. The layerwise approach requires only 1.4x more steps to reach the target loss compared to full GN, and still provides a 3.4x gain over SOAP.

Figure 3: Left: Comparison of Gauss-Newton to the layerwise implementation for Chinchilla-optimal token count for 150M parameter models. The layerwise method achieves almost matching performance to that of the full Gauss-Newton. Right: The Gauss-Newton update closely matches the GN-prox-linear method that has access to higher order loss terms.

Additionally, the GN-prox-linear method, which incorporates higher-order loss terms, does not yield significant improvements over GN, indicating that second-order loss terms are sufficient for convergence speed and that higher-order terms are not critical.

Optimization Strategies

The paper highlights the importance of learning rate schedules and line search for stable GN training. Three learning rate schedules are evaluated: global cosine, global+inner cosine, and constant+inner cosine. Line search is found to be essential for GN and layerwise GN, with step sizes adaptively chosen to maximize stability and convergence.

Figure 4: Three learning rate schedules used for the Gauss-Newton and GN-prox-linear runs. From left to right: global cosine, global+inner cosine, and constant+inner cosine.

Figure 5: Resulting step sizes used from line search for Gauss-Newton and layerwise Gauss-Newton.

Computational Considerations

While the full GN method incurs substantial computational overhead (4-5x slower than standard training), the empirical results serve as a benchmark for the potential of second-order optimization. The layerwise GN approach offers a promising trade-off between computational feasibility and performance, suggesting that further algorithmic refinement could yield practical second-order optimizers for large-scale LLMs.

Implications and Future Directions

The findings imply that:

The GN approximation is highly effective for preconditioning, and higher-order loss terms are not necessary for fast convergence.
Layerwise Hessian structure contains sufficient information to achieve most of the potential gains of full second-order optimization.
There remains a significant performance gap between current approximate methods and the idealized layerwise oracle.

These results motivate future research into computationally efficient second-order methods that better approximate the layerwise Hessian, potentially enabling faster and more scalable LLM training. The paper also suggests that further improvements may be possible by exploring alternative applications of second-order information beyond simple preconditioning.

Conclusion

This work establishes empirical upper bounds for second-order optimization in LLM pretraining, demonstrating substantial gains in iteration complexity and batch size scaling with full and layerwise Gauss-Newton methods. The results indicate that current practical optimizers leave significant performance on the table, and that further development of second-order methods—particularly those leveraging layerwise Hessian information—could yield major improvements in training efficiency for large-scale models.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper asks a simple question: can we speed up training LLMs by using smarter math to choose each step during training? The authors test a powerful “second-order” technique called Gauss–Newton on transformer models and compare it to popular optimizers like Adam, SOAP, and Muon. They want to understand the best possible performance we could get if we used richer curvature information about the loss landscape, even if it’s computationally heavy today.

Key questions the researchers asked

How fast could LLM training be if we used a more exact second-order method (Gauss–Newton) instead of the usual first-order methods?
Which parts of second-order information matter most? Do we need the full, across-all-layers curvature, or is just per-layer curvature enough?
Does adding even higher-order loss information help beyond Gauss–Newton?
How well do these methods work with very large batch sizes, which are important for making training more parallel and faster?

How did they paper it?

Think of training like trying to get to the bottom of a valley (the lowest loss). A first-order optimizer (like Adam) looks at the local slope and takes a step downhill. A second-order optimizer also looks at the curve of the land—the “bend” or steepness changes—so it can pick smarter steps that reach the bottom faster.

First-order vs second-order:
- First-order: uses the slope (gradient) only.
- Second-order: also uses curvature (like how quickly the slope changes). The matrix that encodes this curvature is often called the Hessian. Because the full Hessian is too big and expensive for huge models, this paper uses a closely related, safer version called the Gauss–Newton matrix.
What is Gauss–Newton (in simple terms)?
- It’s a way to estimate the curvature of the loss that avoids some pitfalls of the full Hessian (like making unstable updates).
- You can think of it as getting better “glasses” to see the shape of the path down the valley, so your steps are more direct and less wobbly.
Two key variations they test:
- Full Gauss–Newton: uses curvature information across the entire model.
- Layerwise Gauss–Newton: uses curvature separately per layer, ignoring cross-layer interactions. This is much closer to what practical optimizers do (like Shampoo/SOAP), but here they compute it more precisely to see the upper bound of its potential.
- GN-prox-linear: instead of using a second-order loss approximation, they keep the full (convex) loss on a linearized model. This checks whether including extra loss curvature terms helps beyond Gauss–Newton.
How they measured performance:
- Iteration complexity: how many training steps it takes to reach a target loss (lower is better).
- Critical batch size: the largest batch size before sample efficiency starts dropping. Past this size, making the batch bigger doesn’t reduce steps much and can waste data.
Setup:
- Models: transformer-based LLaMA variants with 45M and 150M parameters.
- Data: C4 dataset (a large text corpus).
- Baselines: AdamW, Muon, SOAP.
- For the heavy Gauss–Newton step, they didn’t build a brand-new fast solver; instead, they used an “inner optimizer” (Muon) to minimize a carefully constructed approximation of the loss that’s equivalent to applying Gauss–Newton preconditioning.

What did they find?

Here are the most important results:

Major speed-up in steps:
- Full Gauss–Newton reached a target validation loss in 54 steps.
- Layerwise Gauss–Newton did it in 78 steps.
- SOAP needed 292 steps.
- In other words, full Gauss–Newton took about 5.4× fewer steps than SOAP, and layerwise Gauss–Newton was close behind.
Better scaling to huge batch sizes:
- Gauss–Newton kept improving as the batch size grew, pushing the “critical batch size” much higher than AdamW, Muon, and SOAP.
- This means Gauss–Newton can use bigger batches more effectively, which is great for parallel training on many GPUs.
Layerwise is almost as good as full:
- Even when ignoring cross-layer curvature, the layerwise Gauss–Newton was close to the full method in both steps and final loss.
- That suggests most of the benefit comes from accurate per-layer curvature, not necessarily cross-layer interactions.
Extra higher-order loss terms didn’t help much:
- The GN-prox-linear method (which keeps more detailed loss information on a linearized model) performed similarly to Gauss–Newton.
- This implies Gauss–Newton’s approximation is already capturing nearly all the useful curvature for faster convergence.
There’s a gap between today’s approximations and the “oracle” layerwise method:
- Current practical optimizers (like SOAP and Muon) use efficient approximations. The precise layerwise Gauss–Newton showed that, if we could compute better per-layer curvature, we could get substantially better performance than what’s common today.

Why does this matter?

Faster training: If we can get close to Gauss–Newton’s performance with practical approximations, LLMs could train in far fewer steps, saving time and energy.
Better use of big hardware: Gauss–Newton handles very large batches well, which makes it easier to use lots of GPUs in parallel without wasting data or slowing down progress.
Design direction: The results suggest focusing on:
- Accurate layerwise curvature approximations (since they get most of the gains).
- Smart ways to incorporate Gauss–Newton-like preconditioning without computing giant matrices.
Reality check: The full method they used is slower per step right now and not production-ready. But it sets a target—an “upper bound” of what’s possible. Future optimizers can aim to get close to this performance while being cheaper to run.

In short, the paper shows that second-order methods—especially Gauss–Newton—have a lot of untapped potential for speeding up LLM training. Most of the benefits can be achieved with layerwise curvature, and there’s room to build more practical algorithms that close the gap with this idealized performance.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a focused list of open issues the paper leaves unresolved, organized by theme to guide future research.

Scalability and systems considerations

Scaling to billion-parameter LLMs: The paper is limited to 45M–150M parameters; it remains unclear whether the proposed full Gauss-Newton (GN) approach (with JVPs, inner optimization, and line search) remains feasible, stable, and effective for 1B–70B+ models under realistic memory and communication constraints.
Wall-clock and energy efficiency: Results are reported in iteration complexity and tokens; comprehensive wall-clock time, GPU-hours, throughput, and energy comparisons (including line search and JVP overheads) are missing, especially in multi-node settings with data/model parallelism.
Memory and communication footprint: The actual memory usage and communication overhead of the GN and layerwise GN implementations are not quantified; practical deployment requires profiling peak memory, activations, and cross-device communication costs.
Distributed implementation details: No analysis of scalability on large clusters (e.g., DDP/ZeRO, pipeline and tensor parallel); it’s unclear how GN’s inner loops and line search interact with distributed training efficiency and straggler effects.
Numerical stability and precision: The paper does not report dtype/precision (e.g., bf16/fp8) and numerical stability considerations for JVP/HVP-heavy workloads; robustness under mixed-precision is unknown.

Methodological and theoretical questions

Inner-solver dependence and accuracy: The method’s performance is tightly coupled to Muon as the inner optimizer; there is no systematic paper of inner-solve accuracy (e.g., KKT residuals), stopping criteria, and how solve accuracy trades off with outer-loop convergence.
“Upper bound” claim validation: The assertion that GN is upper-bounded by Muon with batch size b_inner on the true model is not formally justified under line search and approximation mismatch; provide proofs or counterexamples and quantify empirical gaps to this bound.
Line search policy and overhead: The choice of line search (candidate step sizes, data used, frequency, stopping conditions) is ad hoc; its compute cost, stability contribution, and sensitivity are not quantified or compared to trust-region/damping alternatives (e.g., Levenberg–Marquardt).
When do higher-order loss terms matter? GN vs GN-prox-linear appears similar on cross-entropy; it is unknown for other convex losses, non-convex composite objectives, or regimes (e.g., late training) where third/higher-order loss terms might matter; a principled criterion for when to include these terms is missing.
Spectral and structural analysis: No analysis of curvature spectra, effective condition numbers, or block-structure in attention/MLP to explain why layerwise GN nearly matches full GN; understanding which cross-layer blocks actually matter could inform practical blockwise preconditioners.
Damping/regularization strategies: The approach uses line search but not systematic damping/trust-region methods; the effect of damping strength, Tikhonov regularization, or LM-style updates on stability and sample efficiency is unexplored.
Preconditioner variants: Only G^{-1} is used; investigating fractional powers (G^{-α}), adaptive damping, or hybrid Fisher/GN combinations is left open.
Convergence guarantees: There are no theoretical guarantees for the outer loop (with inexact inner solves and line search) on non-convex neural objectives; conditions ensuring monotone decrease or global convergence are unaddressed.

Generality across tasks, losses, and architectures

Beyond language modeling on C4: The paper is confined to pretraining on C4; generality to other corpora (code, multilingual, instruction tuning), modalities (vision/speech), or downstream tasks (reasoning, QA, RLHF) is unknown.
Other loss functions and training regimes: Results hinge on cross-entropy; applicability to other losses (e.g., contrastive, RL objectives), curriculum/mixture-of-objectives, or fine-tuning settings is untested.
Architectural diversity: Only LLaMA-like transformers (no MoE, long-context models, normalization variants, activation functions) are considered; whether GN gains persist for architectures with different curvature structure is open.
Sequence length effects: Experiments fix sequence length at 1024; how GN performance and its curvature approximations scale with longer contexts (e.g., 8k–32k) remains unknown.

Batch size scaling and measurement choices

Mapping “global batch” to practical parallelism: Defining b = N × b_inner (with N inner steps) confounds comparisons with standard large-batch data-parallel training; a principled equivalence between this accumulation scheme and true large minibatches is not established.
Critical batch size characterization: The paper probes up to 120M tokens but does not report uncertainty or sensitivity across seeds; the robustness of critical batch size estimates and the shape of scaling laws (exponents) remain open.
Early vs late training dynamics: GN’s largest gains appear early; it is unclear how performance evolves in later phases or with much longer training budgets (e.g., tens of billions of tokens).

Fairness, sensitivity, and reproducibility

Hyperparameter robustness: While sweeps are performed, differences in schedules (e.g., EWA vs cosine) and line-search availability across methods may bias results; a standardized tuning protocol and ablation on schedule choices are needed for fair comparisons.
Seed variance and confidence intervals: Results are presented without multiple seeds or confidence intervals; variance in iteration counts and final losses should be quantified.
Code and reproducibility: Implementation details (e.g., exact data for line search, batching policies, autodiff knobs) and code artifacts needed for reproduction at scale are not released or documented.

Alternatives and comparative baselines

Comparison to Hessian-free CG and K-FAC: The GN method is not directly compared to strong second-order baselines like damped HF+CG or modern K-FAC variants under matched compute; head-to-head wall-clock and sample-efficiency comparisons are missing.
Specialized solvers for GN-prox-linear: The convex inner problem is solved with Muon; whether specialized convex solvers (e.g., kernel methods, coordinate descent, proximal methods, or second-order convex solvers) yield better accuracy-speed trade-offs is untested.
Blockwise and hybrid approximations: Between full and layerwise GN, intermediate blockwise schemes (e.g., attention block coupling, QKV tied blocks, residual coupling) are not explored; identifying minimal cross-layer structure needed for near-GN performance is an open design space.

Generalization and implicit bias

Effects on generalization and calibration: The impact of GN preconditioning on generalization (beyond validation loss), calibration, and robustness is not assessed; implicit bias differences vs first-order methods remain unexplored.
Overfitting and regularization: Interactions between GN, weight decay, EWA, and other regularizers are not systematically studied; principled regularization for GN (e.g., curvature-aware decay) is open.

Towards new scaling laws and compute-optimality

Compute-optimal training with second-order methods: Chinchilla scaling is used as a reference, but the compute-optimal batch/sequence/token trade-offs may differ under GN; deriving and validating new scaling laws for second-order-optimized LLMs is an open direction.
Token efficiency vs step efficiency: A rigorous framework to trade off sample efficiency (tokens) and iteration efficiency (steps), accounting for GN overheads, is needed to judge end-to-end cost-performance at scale.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following items can be deployed now to improve training workflows, benchmarking, and optimizer selection, with attention to the paper’s empirical results and practical constraints.

Build a Gauss–Newton “oracle” benchmarking harness for optimizer selection (industry, academia)
- Use full GN and layerwise GN on small-to-mid models (≤150M params) to estimate lower bounds on iteration complexity and critical batch size, then select/tune practical optimizers (e.g., SOAP, Muon) accordingly.
- Potential tools/products/workflows: a “second‑order oracle” suite that runs short GN episodes to rank candidate optimizers and batch sizes; dashboards for iteration/throughput tradeoffs; extensions to AlgoPerf-style benchmarks.
- Assumptions/dependencies: JVP/HVP support in the stack (e.g., JAX/Neural Tangents or equivalent), multiple GPUs, performance ceiling upper-bounded by inner optimizer; results validated on LLaMA-like architectures and C4.
Plan batch-size scaling and gradient accumulation informed by measured critical batch sizes (industry, cloud training ops)
- Apply findings that AdamW saturates near ~4M tokens, SOAP/Muon ~12–40M, while GN extends further; set gradient accumulation and data-parallel strategies to avoid sample-efficiency loss.
- Potential tools: a “critical batch size estimator” script integrated into MLOps pipelines; automated batch-size schedules tied to validation-loss plateaus.
- Assumptions/dependencies: comparable model/dataset setups; monitoring infrastructure for validation‑loss vs steps; large-batch data loaders.
Stabilize large-batch training with schedule and line-search strategies distilled from GN runs (software/MLOps)
- Incorporate global cosine or constant+inner cosine schedules, backtracking line search, and warm-starting next inner loop from pre-line-search parameters to reduce instability at large LR.
- Potential tools: optimizer wrappers adding line search and LR schedules; training recipes for SOAP/Muon at large batch sizes.
- Assumptions/dependencies: access to optimizer internals; reliable loss evaluation for line search; telemetry to detect divergence.
Curvature-aware tuning of layerwise optimizers (software)
- Use the paper’s evidence that precise layerwise curvature captures most GN gains to refine Shampoo/SOAP settings (e.g., preconditioner accumulation, eigenbasis stability, regularization).
- Potential tools: “Shampoo/SOAP curator” that auto-tunes per-layer preconditioners based on curvature proxies; periodic eigenbasis diagnostics.
- Assumptions/dependencies: existing Shampoo/SOAP implementations; telemetry of per-layer gradient statistics; consistency with cross-entropy objectives.
Fast fine-tuning with few steps on small models (education, startups, research labs)
- Exploit GN/GN‑prox‑linear to push rapid adaptation on ≤150M models for domain-specific corpora when wall-clock costs are acceptable and step count must be minimal.
- Potential tools: JAX-based fine-tuning scripts with GN/GN‑prox‑linear; line-search-enabled trainers for quick convergence.
- Assumptions/dependencies: heavy per-step cost (4–5x), JVP/HVP and line search in stack, cross-entropy tasks; diminishing returns for very large models.
Extend optimization benchmarks to include “oracle gap” metrics (academia, benchmarking orgs)
- Report the gap between practical methods (e.g., SOAP/Muon) and GN/layerwise GN on standard tasks to set targets for optimizer development.
- Potential tools: expanded AlgoPerf-like benchmark track; public leaderboards with “gap to GN-oracle.”
- Assumptions/dependencies: community adoption, reproducible GN baselines, standard datasets (e.g., C4).
Training-cost and ESG reporting with optimization-aware projections (policy, enterprise governance)
- Use empirically measured iteration reductions and batch-size scaling behaviors to inform energy/cost projections and procurement decisions emphasizing optimizer R&D.
- Potential tools: cost/energy calculators that incorporate optimizer and batch-size effects; policy briefs on optimization’s role in training efficiency.
- Assumptions/dependencies: mapping iteration changes to energy/wall-clock under current hardware; GN’s per-step overhead may offset step savings.
Curriculum and methodology for teaching modern second-order methods (academia)
- Incorporate GN, layerwise GN, and GN‑prox‑linear into advanced optimization coursework and labs on LLM training dynamics.
- Potential tools: teaching notebooks; assignments using JVP-based GN to paper curvature and critical batch size empirically.
- Assumptions/dependencies: suitable hardware for labs; accessible code (e.g., JAX/Neural Tangents).

Long-Term Applications

The items below require further research, scaling, engineering, or ecosystem development before broad deployment.

Practical layerwise Gauss–Newton optimizer for billion-parameter LLMs (software, cloud)
- Design a memory/computation‑efficient algorithm that more precisely captures per-layer curvature (approaching the paper’s layerwise oracle) and scales distributedly.
- Potential tools/products: “GN-LLM Trainer” library; PyTorch/JAX kernels for efficient HVP/JVP; distributed line-search modules; cloud managed second‑order optimizer service.
- Assumptions/dependencies: efficient HVP/JVP primitives, numerically stable preconditioners, robust distributed implementations, hardware acceleration.
Adaptive batch-size schedulers that push the critical batch size upward (cloud training ops)
- Develop curvature-aware controllers that adjust batch size and LR on-the-fly to stay near optimal scaling and minimize sample-efficiency loss.
- Potential products: “Adaptive CBS Scheduler” integrated into training platforms; ML-based controllers trained against GN-oracle targets.
- Assumptions/dependencies: live curvature proxies; reliable loss/gradient telemetry; safe adjustments without destabilizing training.
Meta-optimization (AutoML for optimizers) using GN-oracle guidance (software/MLOps)
- Use short GN runs to evaluate candidate optimizer configurations, then select per-layer strategies (e.g., Shampoo vs SOAP variants) to minimize the oracle gap.
- Potential tools: optimizer AutoML that searches LR schedules, momentum, preconditioner parameters; CI/CD integration for model training recipes.
- Assumptions/dependencies: compute budget for oracle evaluations; generalization from small-oracle runs to full-scale training.
Energy- and cost-efficient training policies and standards (policy, consortia)
- Draft procurement guidelines and sustainability standards encouraging adoption of curvature-aware optimization research and reporting of optimizer choices and batch-size scaling.
- Potential outputs: standard metrics for “optimizer efficiency,” reporting templates, incentives for optimization R&D in grants.
- Assumptions/dependencies: large-scale validations; stakeholder buy-in; standardized benchmarking.
Hardware–software co-design for second-order primitives (semiconductors, cloud)
- Accelerate JVP/HVP and line-search evaluations via specialized kernels or instruction support; rework memory hierarchies to accommodate curvature computations.
- Potential products: GPU/TPU libraries optimized for second-order ops; compiler passes that fuse curvature computations.
- Assumptions/dependencies: vendor collaboration; proven demand from training teams; stability of numerical routines at scale.
Curvature-informed training curricula and scheduling across phases (LLM training pipelines)
- Integrate second-order phases (e.g., early rapid progress with GN-like steps, then cheaper first-order phases) to reduce total wall-clock while maintaining stability.
- Potential workflows: phased training recipes; automated transitions based on curvature/instability signals.
- Assumptions/dependencies: robust phase-change criteria; consistent gains across datasets/objectives; orchestration support.
Domain-specific model training acceleration (healthcare, finance, robotics)
- Apply advanced layerwise second-order methods to speed pretraining/fine-tuning of domain LLMs where data and compute are costly, reducing time-to-deployment.
- Potential tools/products: verticalized training services; optimizer presets for medical/financial corpora; rapid compliance documentation enabled by shorter training cycles.
- Assumptions/dependencies: method generalizes beyond C4; domain objectives remain compatible (often cross-entropy-like); privacy/compute constraints.
Edge/on-device fine-tuning with efficient second-order approximations (mobile/IoT)
- Create compressed, approximate curvature methods enabling small-step adaptation on the edge (e.g., personalization) with limited compute budgets.
- Potential products: lightweight curvature-aware fine-tuning SDKs; hybrid server–edge workflows where server estimates curvature and ships updates.
- Assumptions/dependencies: extreme resource constraints; robust approximations; secure update mechanisms.

Notes on Assumptions and Dependencies Across Applications

Scalability: Evidence is from 45M–150M param LLaMA models on C4; behavior at billion-scale is promising but unverified.
Loss/architecture: Results assume cross-entropy with LLaMA-like transformers; other objectives or architectures may differ in PSD properties and curvature behavior.
Implementation overhead: Full GN is 4–5x slower per step in the paper; net wall-clock gains depend on practical approximations approaching the layerwise oracle.
Inner optimizer dependence: GN performance is upper-bounded by the inner optimizer (Muon in the paper); swapping inner optimizers changes ceilings.
Stability tooling: Backtracking line search and warm-starting inner loops were critical for stability; productionization requires robust implementations and monitoring.
Ecosystem: Efficient JVP/HVP and distributed second-order support in mainstream frameworks are prerequisites for broad deployment.

View Paper Prompt View All Prompts

Glossary

AdaGrad: An adaptive gradient optimizer that accumulates past squared gradients to adjust per-parameter learning rates. "AdaGrad \citep{JMLR:v12:duchi11a} maintains an accumulator over the vectorized gradient $g_t=\mathrm{vec}(G_t)\in\mathbb{R}^{mn}$ ."
AdamW: A variant of Adam that decouples weight decay from the gradient-based update to improve generalization. "We run AdamW \citep{adamw}, Muon \citep{jordan2024muon}, and SOAP \citep{vyas2025soapimprovingstabilizingshampoo} as baselines."
AlgoPerf: A benchmark suite for comparing optimization algorithms across tasks and models. "Shampoo won the recent optimization algorithms benchmark called AlgoPerf \citep{Kasimbeg2025AlgoPerfResults}, outperforming Adam by a margin of 28\%."
Chinchilla-optimal: Scaling laws that prescribe optimal data and model size trade-offs to maximize performance under a compute budget. "We train 150M-parameter models for 3B tokens following Chinchilla-optimal scaling laws \citep{hoffmann2022trainingcomputeoptimallargelanguage}, ranging batch size from 1.2M to 120M tokens."
Conjugate gradient (CG): An iterative method for solving large linear systems, often used to approximately compute Newton steps without forming matrices. "prior work on Hessian-free optimizers use the conjugate gradient (CG) to solve an incomplete (unconverged) optimization of the Newton step rather than storing an approximation to the Hessian."
Critical batch size: The batch size beyond which increasing the batch provides little to no reduction in steps or worsens sample efficiency. "we use a batch size significantly beyond that method's critical batch size \citep{mccandlish2018empiricalmodellargebatchtraining, shallue2019measuringeffectsdataparallelism} such that further increasing batch size does not reduce the number of training steps needed to achieve a given performance."
Data parallelism: A training paradigm that distributes mini-batches across workers; its efficiency degrades beyond the critical batch size. "We also analyze how this idealized method affects the critical batch size~\citep{mccandlish2018empiricalmodellargebatchtraining, shallue2019measuringeffectsdataparallelism,jain2018parallelizing}, a key measure of data parallelism efficiency."
Gauss-Newton matrix: A positive semi-definite approximation to the Hessian that captures the curvature of the loss with respect to model outputs while ignoring model curvature. "The Gauss-Newton matrix is defined to be the first term of the following decomposition of the Hessian,"
GN-prox-linear: A prox-linear variant of Gauss-Newton that minimizes the original convex loss over the linearized model instead of using a second-order loss approximation. "A prox-linear version of Gauss-Newton (GN-prox-linear) \citep{Burke85, drusvyatskiy2017proximalpointmethodrevisited}, which utilizes the higher order information of the loss function itself (see Algorithms~\ref{algorithm:gn} and ~\ref{algorithm:lm} for comparison)."
Gradient accumulation: A technique to simulate large batch sizes by accumulating gradients over multiple smaller steps before updating parameters. "Since batch sizes are increased using gradient accumulation in our experiments, we choose batch size based on each method's critical batch size to save compute."
Hessian-free optimization: Second-order optimization that avoids explicitly forming the Hessian by using Hessian-vector products within iterative solvers. "Most related to our work is Hessian-free optimization, which avoids explicit Hessian formation by leveraging Hessian-vector products \citep{hessianfree}."
Hessian-vector products (HVPs): Products of the Hessian with a vector that enable curvature-aware updates without forming the Hessian. "Most related to our work is Hessian-free optimization, which avoids explicit Hessian formation by leveraging Hessian-vector products \citep{hessianfree}."
Iteration complexity: The number of optimization steps required to reach a target performance (e.g., a specific validation loss). "we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters."
Jacobian-vector products (JVPs): Efficient operations that multiply the Jacobian by a vector, used to compute curvature-related terms without materializing large matrices. "we instead run a functionally equivalent method that leverages Jacobian-vector products (JVPs) to avoid explicitly storing the Hessian."
Kernelized classification: A framework that uses kernel methods to perform classification in high-dimensional feature spaces induced by kernels. "Moreover, this problem is related to the richly studied literature of kernelized classification \citep{shaikernelclassificatin} (albeit with cross-entropy loss, instead of the max-margin loss)."
Layerwise Gauss-Newton: A variant that applies Gauss-Newton updates separately per layer, ignoring cross-layer curvature, to reduce computational and memory costs. "Gauss-Newton and Layerwise Gauss-Newton reach the target loss in 54 and 78 steps respectively, compared to 292 steps for SOAP."
Learning rate schedules: Prescribed variations of the learning rate over training to improve convergence and stability (e.g., cosine schedules). "We experiment with three learning rate schedules for Gauss-Newton runs, which we refer to as global cosine,''global+inner cosine,'' and ``constant+inner cosine.''"
Line search: A procedure that selects the step size along a proposed update direction by minimizing the loss, improving stability. "We then update the model parameters using a line search."
Muon: An optimizer that orthonormalizes gradient updates (via Newton-Schulz) and can be viewed as Shampoo without preconditioner accumulation. "Muon \citep{jordan2024muon} tracks the first moment of the gradient, denoted $M_t$ , and performs an orthonormal update:"
Neural-tangents: A library for computing neural tangent kernels and related Taylor approximations for models. "To compute the necessary Taylor approximations we use the neural-tangents library from \cite{neuraltangents2020}."
Newton-Schulz orthogonalization procedure: An iterative matrix method used to produce orthonormal updates from gradient information. "where $NS$ denotes a Newton-Schulz orthogonalization procedure."
Newton's method: A second-order optimization technique that updates parameters using the inverse Hessian times the gradient. "This is known as Newton's method, and results in the following update rule:"
Positive semi-definite (PSD): A matrix property where all eigenvalues are non-negative, ensuring well-behaved quadratic forms. "the Hessian is not guaranteed to be positive semi-definite (PSD), and therefore Newton's method does not guarantee that the loss decreases in each iteration or even converges."
Preconditioner: A matrix or transformation applied to gradients to improve conditioning and accelerate convergence. "Shampoo maintains a separate preconditioner for each dimension of the weight matrix:"
Shampoo: A second-order optimizer that maintains layerwise left/right preconditioners, approximating Gauss-Newton curvature efficiently. "Shampoo \citep{gupta2018shampoopreconditionedstochastictensor} was originally motivated by AdaGrad, but can be viewed as an approximation of the Gauss-Newton component of the Hessian"
SOAP: A variant of Shampoo that performs AdamW updates in Shampoo’s eigenbasis to stabilize and improve performance. "SOAP \citep{vyas2025soapimprovingstabilizingshampoo} is a recent variant of Shampoo that runs AdamW \citep{adamw} in the eigenbasis provided by Shampoo."
Taylor expansion: A local polynomial approximation of a function using its derivatives; used here for first-order model and second-order loss approximations. "be the loss function on the first-order Taylor expansion of $f$ around current parameters $\theta_t$ ."
Transformer models: Attention-based neural network architectures widely used in LLMs. "applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters."
Weight averaging: Exponential moving average of parameters to stabilize training and improve generalization. "For baselines, weight averaging is used only with the constant schedule."
Weight decay: L2 regularization applied via parameter shrinkage to improve generalization and stability. "we add weight decay to the optimizer as well as a weight decay term to the loss, which adds regularization on the $\ell_2$ norm of the magnitude of the parameter update."

View Paper Prompt View All Prompts

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (4)

Collections

Tweets

This paper has been mentioned in 7 tweets and received 365 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

alphaXiv

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton (37 likes, 0 questions)

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton (2510.09378v1)

Summary

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Introduction

Background and Motivation

Full Gauss-Newton Methodology

Experimental Setup

Iteration Complexity and Batch Size Scaling

Layerwise and Higher-Order Loss Term Approximations

Optimization Strategies

Computational Considerations

Implications and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

Key questions the researchers asked

How did they paper it?

What did they find?

Why does this matter?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Scalability and systems considerations

Methodological and theoretical questions

Generality across tasks, losses, and architectures

Batch size scaling and measurement choices

Fairness, sensitivity, and reproducibility

Alternatives and comparative baselines

Generalization and implicit bias

Towards new scaling laws and compute-optimality

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Assumptions and Dependencies Across Applications

Glossary

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

Tweets

alphaXiv