Levenberg-Marquardt Damping

Updated 9 November 2025

Levenberg-Marquardt damping is a regularization technique in nonlinear optimization that balances curvature information with trust-region adjustments for robust, adaptive updates.
It adaptively tunes a scalar damping parameter using gain ratios or trust-region tests to stabilize the solution process in noisy, ill-posed, or high-dimensional problems.
Variants such as stochastic, Riemannian, and diffusion-based LM extend the method to handle model inaccuracies and improve convergence rates in practical, complex applications.

Levenberg-Marquardt Damping is a foundational technique in nonlinear optimization and regularized inverse problems, central to controlling the trade-off between trust-region (or step-size) and curvature-based updates in iterative methods for nonlinear least-squares and their extensions. Damping, or regularization, is enforced by augmenting the (Gauss-Newton or Hessian) subproblem with a scalar multiple of the identity, promoting numerical stability, bounding updates, and, when adaptively tuned, guaranteeing robust global convergence with strong complexity guarantees even in stochastic, ill-posed, and constrained settings.

1. Mathematical Definition and Core Update Mechanism

In the canonical Levenberg-Marquardt (LM) method, each iteration forms a regularized subproblem for the nonlinear least-squares objective

$\min_x \ \tfrac12 \|F(x)\|^2$

with model

$m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$

where $J(x_k)$ denotes the Jacobian at $x_k$ , and $\gamma_k > 0$ is the damping or regularization parameter. The subproblem yields the update

$(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$

For small $\gamma_k$ , this is effectively a Gauss-Newton update (fast, potentially unstable); for large $\gamma_k$ , the update resembles steepest (gradient) descent (robust, slow). The mechanism for updating $\gamma_k$ —classically through gain ratio or trust-region tests—is precisely what is called LM damping.

Multiple adaptive updates exist:

In classical deterministic LM, $\gamma_k$ is scaled up if the predicted reduction underestimates actual reduction and down otherwise, typically via

$m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 0

with $m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 1.

In modern trust-region interpretations and generalizations (e.g., (Bergou et al., 2018, Bergou et al., 2020, Adachi et al., 2022)), the damping parameter is adapted based on explicit connections to the trust-region radius, stationary criteria, or probabilistic acceptance measures.

A crucial functional form is

$m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 2

$m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 3

with $m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 4 updated by multiplicative factors according to acceptance criteria involving actual-to-predicted reduction ratios.

2. Adaptive Damping, Trust-Region Connection, and Probabilistic Extensions

LM damping is fundamentally connected to the trust-region radius. The subproblem

$m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 5

is equivalent—in terms of optimality measures and update policies—to the penalized LM formulation with

$m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 6

so that shrinking $m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 7 corresponds to expanding the trust-region, and vice versa.

Stochastic generalizations (Bergou et al., 2018) replace exact models and function values with random surrogates, enforcing all accuracy, descent, and stationarity conditions only in expectation or with high probability. Damping (here, $m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 8) is crucial for robust convergence in noisy, data-unstable, or oracle-inexact regimes. The update: $m_k(s) = \tfrac12 \|F(x_k) + J(x_k)s\|^2 + \tfrac12 \gamma_k \|s\|^2$ 9 is shown to yield worst-case complexity matching deterministic methods for appropriate probabilistic accuracy (Bergou et al., 2018).

Tables comparing representative parameterizations:

Method/Ref	Damping Formulation	Update Mechanism
Classical LM	$J(x_k)$ 0 scalar	Ratio/gain-based
Stochastic LM	$J(x_k)$ 1	Probabilistic ratio test
Riemannian LM	$J(x_k)$ 2	Trust-region style
Prox-linear LM	$J(x_k)$ 3	Backtracking/monotonicity
Inertial LM	Fixed sequence $J(x_k)$ 4	No adaptation

3. Damping in Variants: Stochastic, Inertial, Constrained, and Riemannian LM

Stochastic LM (Bergou et al., 2018) demands damping that functions reliably under random model/gradient or function noise. The “memory" parameter (e.g., last successful $J(x_k)$ 5) and probabilistic enforcement of stationarity and descent are key enhancements relative to purely deterministic counterparts.

Inertial LM (Leitão et al., 2024) and range-relaxed variants (Leitao et al., 2020) may forego per-iteration adaptive updating of $J(x_k)$ 6, instead requiring only a-priori lower (and possibly upper) bounds to ensure uniform invertibility and convergence.

Constraint handling via majorization–minimization (MM-LM, (Marumo et al., 2020)) requires $J(x_k)$ 7 to ensure that the quadratic model majorizes the true objective, typically

$J(x_k)$ 8

with $J(x_k)$ 9 increased adaptively based on an upper-bounding acceptance test.

Riemannian LM (Adachi et al., 2022) employs

$x_k$ 0

with trust-region–like gain ratios, enabling both global and local convergence under error-bound assumptions. This generalizes LM damping to manifold optimization, retaining complexity and local rate guarantees without explicit manifold Hessians.

Prox-linear/Generalized LM (Marumo et al., 2022) invokes a damping parameter

$x_k$ 1

linked to the current objective gap and adjusted by backtracking to enforce sufficient decrease and optimal subproblem solvability, supporting local quadratic rates and optimal oracle complexity bounds.

4. Theoretical Properties, Complexity, and Convergence Guarantees

Across deterministic and stochastic LM frameworks, adaptive damping with appropriate lower bounds provides uniform invertibility of regularized normal equations and promotes monotonic descent, even in the presence of noise or ill-posedness.

Typical complexity results (for achieving $x_k$ 2-stationarity):

Deterministic and stochastic LM: $x_k$ 3 iterations under standard smoothness and boundedness conditions (Bergou et al., 2018, Bergou et al., 2020, Adachi et al., 2022, Marumo et al., 2022).
Riemannian LM: $x_k$ 4 steps, matching or improving best-known Euclidean LM/trust-region schemes (Adachi et al., 2022).
Local convergence: Quadratic if the true minimum is zero-residual, linear otherwise; enabled by adaptive damping shrinking as the minimizer is approached, so that the regularized system behaves as Gauss–Newton (Bergou et al., 2020, Adachi et al., 2022, Marumo et al., 2022).

For regularization in ill-posed settings, simply imposing a fixed lower bound (determined by bounds on derivative norms and problem constants) suffices for monotonicity and stability (Leitao et al., 2020, Leitão et al., 2024). Range-relaxed criteria further adapt $x_k$ 5 to land a linearized residual within a computable interval, decreasing computational search overhead while ensuring step-size control.

5. Empirical and Application-Oriented Insights

Empirical studies (Bergou et al., 2020, Marumo et al., 2020, Li et al., 2022) show that the practical tuning and adaptation of the damping parameter can have a dramatic effect on convergence rate, stability, and ease of implementation:

SLM ("step-size" LM) uses a fixed damping parameter but adaptively scales the trial step, which empirically aids convergence in high-dimensional robot calibration tasks (Li et al., 2022).
Range-relaxed and fixed-lower-bound algorithms outperform dynamic gain-based LM when the noise model or ill-posedness induces instability (Leitao et al., 2020, Leitão et al., 2024).
For strongly nonlinear problems or poorly conditioned Jacobians, gain-ratio adaptation prevents divergence, facilitating stable convergence even when standard Gauss-Newton fails (Nadjiasngar et al., 2011).
MM-LM and prox-linear (APG-based) LM variants accelerate convergence on constrained and/or high-dimensional problems and display robustness to initialization and parameter scaling (Marumo et al., 2020, Marumo et al., 2022).

6. Damping in High-Dimensional and Diffusion-Based Methods

Recent work explores LM-type damping in contexts well beyond classical nonlinear least-squares. In high-dimensional diffusion models, LM-Langevin algorithms employ a low-rank Gauss–Newton Hessian approximation, regularized by a damping scalar: $x_k$ 6 with $x_k$ 7 selected via quick binary search to optimize sample quality. Critically, $x_k$ 8 not only stabilizes the inversion of near-singular preconditioners but admits theoretical guarantees on exponential ergodicity, stationarity, and bounded condition numbers (Wang et al., 30 May 2025).

Empirical selections of $x_k$ 9 in generative models are based on minimizing downstream error metrics (FID) and are reported in standard ranges per model, typically $\gamma_k > 0$ 0 to $\gamma_k > 0$ 1. The approach smoothly interpolates between traditional Langevin diffusion (large $\gamma_k > 0$ 2) and a Newton/Langevin hybrid (small $\gamma_k > 0$ 3) and is agnostic to training details, leveraging score network outputs directly.

7. Summary Table: Damping Strategies and Their Domains

Setting / Variant	Damping Rule	Update Type	Complexity / Guarantee
Classical LM	Adaptive $\gamma_k > 0$ 4 via gain ratios	Ratio-based	$\gamma_k > 0$ 5
Stochastic LM	$\gamma_k > 0$ 6	Probabilistic/ratio	Matches deterministic LM
Range-Relaxed LM	Any $\gamma_k > 0$ 7 putting residual in $\gamma_k > 0$ 8	Interval search	Geometric decay, monotonicity
MM-LM / Prox-linear	$\gamma_k > 0$ 9 (MM-LM) or $(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$ 0	Acceptance test/backtrack	Quadratic local, $(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$ 1
Riemannian LM	$(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$ 2	Trust-region style	Global/local under error bound
Inertial LM	Fixed lower ( $(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$ 3) and upper bounds, no per-step adaptation	Regularization	Strong/semi-convergence
Diffusion LML	$(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$ 4 low-rank $(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$ 5 $(J_k^T J_k + \gamma_k I)s_k = -J_k^T F(x_k).$ 6	Fixed scalar (empirical)	Exponential ergodicity, empirical FID

Levenberg-Marquardt damping, as formalized and extended in these works, reveals a unifying principle: carefully regularized, adaptively controlled curvature information enables globally convergent, robust, and locally fast optimization across a wide spectrum of nonlinear and stochastic models. Adaptive damping rules grounded in gain-ratio logic, model-majorization, or fixed lower/upper bounds translate into provable and empirically strong performance, even as the underlying problems grow in noise, ill-posedness, nonlinearity, and dimension.