Newton-Kaczmarz Algorithm

Updated 11 December 2025

Newton-Kaczmarz Algorithm is an iterative projection method that hybridizes Newton's method and the Kaczmarz technique to solve nonlinear systems one equation at a time.
It updates parameters by sequentially linearizing scalar equations using row-vector pseudoinverses, thereby reducing computational burden without full Jacobian inversion.
Applied to Kolmogorov-Arnold models, the method enables robust, efficient parameter estimation in large-scale regression problems with improved convergence under noisy initializations.

The Newton-Kaczmarz (NK) algorithm is an iterative projection-based method for solving nonlinear systems of equations, developed as a hybridization of Newton's method and the classical Kaczmarz row-action technique. Its principal application, as presented by Poluektov & Polar, is the efficient estimation of parameters in so-called Kolmogorov-Arnold models—structured representations of multivariate functions via compositions of univariate functions, as guaranteed by the Kolmogorov-Arnold theorem. The NK method linearizes and optimizes one scalar equation at a time, thus avoiding the explicit computation and inversion of full Jacobian matrices, and is particularly well-suited for large-scale regression problems where the number of equations or data records is considerable (Poluektov et al., 2023).

1. Mathematical Formulation

Given a system of $N$ nonlinear equations in $r$ unknowns, $\mathbf{L}(\mathbf{Z}) = 0$ , where $\mathbf{L}(\mathbf{Z}) = [L^1(\mathbf{Z}),\ldots,L^N(\mathbf{Z})]^T$ and $\mathbf{Z} \in \mathbb{R}^r$ , the objective is to find $\mathbf{Z}$ such that all residuals vanish. Instead of employing a classical Newton update, which requires the computation and inversion of the $r \times r$ Jacobian $J(\mathbf{Z})$ , the NK algorithm updates $\mathbf{Z}$ sequentially with respect to one equation, indexed by $i$ , at each iteration: $\mathbf{Z}^{q+1} = \mathbf{Z}^q - \mu \frac{L^i(\mathbf{Z}^q)}{\|\mathbf{J}_i(\mathbf{Z}^q)\|^2} \mathbf{J}_i(\mathbf{Z}^q)^T$ where $\mathbf{J}_i(\mathbf{Z}) = \nabla L^i(\mathbf{Z})^T$ is the row-Jacobian of the $i$ -th equation and $\mu \in (0,2)$ is a relaxation parameter. This excludes second order terms and projects onto the local linearization hyperplane defined by $L^i$ . The update exploits the row-vector pseudoinverse: $\mathbf{J}_i(\mathbf{Z})^\dagger = \frac{\mathbf{J}_i(\mathbf{Z})^T}{\|\mathbf{J}_i(\mathbf{Z})\|^2}$ yielding an efficient one-dimensional adaptation at each step [(Poluektov et al., 2023), eqs. (8)-(9)].

2. Algorithmic Structure

The generic NK iteration applies the above update in a cyclic or randomized fashion across the $N$ equations (or data records):

Initialize $\mathbf{Z}^0$ .
For each iteration $q$ $q$ :
- Select index $i$ (cyclic: $i=(q \bmod N)+1$ , or random).
- Compute the residual $L^i(\mathbf{Z}^q)$ and gradient $\nabla L^i(\mathbf{Z}^q)$ .
- If $\|\nabla L^i(\mathbf{Z}^q)\|^2$ is too small, break (singular linearization).
- Update $\mathbf{Z}$ via the projected step with relaxation $\mu$ .
- Stop upon convergence of the parameter update or the residual.

When specialized to the Kolmogorov-Arnold (KA) model, the method adapts to the parameterization of the representation’s inner and outer univariate functions. The parameters $H_{kjp}$ and $G_{kl}$ govern the decomposition into basis functions $\phi^p(x)$ and $\psi^l(t)$ , respectively. The update rules for these parameters are:

For all $k, j, p$ : $H_{kjp}^{q+1} = H_{kjp}^q + \mu B_{kjp}\Delta$
For all $k, l$ : $G_{kl}^{q+1} = G_{kl}^q + \mu A_{kl}\Delta$

where model outputs $A_{kl}$ , $B_{kjp}$ , and scaling factor $\zeta$ are computed from current $H$ and $G$ , and $\Delta = (y_i-E)/\zeta$ measures the normalized residual [(Poluektov et al., 2023), eqs. (17)-(18)].

3. Application to Kolmogorov-Arnold Models

Kolmogorov-Arnold models, or networks, express continuous multivariate functions by composition of parameterized univariate transforms. These are constructed from basis expansions: $f(\mathbf{X}) = \sum_{k, l} G_{kl} \, \psi^l \left(\sum_{j, p} H_{kjp} \, \phi^p(X_j) \right)$ Determining suitable $H_{kjp}$ and $G_{kl}$ from data constitutes a nonlinear inverse problem. The NK approach decomposes the solution into iterative 1D projections, significantly reducing computational burden per update to $O(n m + s)$ where $n$ and $s$ are grid sizes for basis expansions [(Poluektov et al., 2023), section 4.2]. This structure confers distinct practical advantages in memory usage and batchwise computation.

4. Convergence and Robustness

Under the conditions that each $L^i(\mathbf{Z})$ is continuously differentiable in a neighborhood of a solution $\mathbf{Z}^*$ and that $\nabla L^i(\mathbf{Z}^*) \ne 0$ , the NK algorithm exhibits local convergence for sufficiently good initial guesses [(Poluektov et al., 2023), appendix A]. Empirical results indicate improved robustness relative to the Gauss-Newton (GN) method in fitting KA model parameters, particularly as the initial guess is perturbed away from the true solution. In ridge-function identification tasks (for example, $m=5$ , $s=3$ , $N=400$ ), NK maintains high frequencies of low-RMSE solutions even for poor initializations; for perturbation magnitude $\alpha=1.2$ , GN achieves RMSE $<10\%$ in $\approx33\%$ of runs, compared to NK’s $\approx78\%$ [(Poluektov et al., 2023), Table 1].

The practical convergence rate with the KA model and piecewise-linear basis $\phi$ , $\psi$ can be estimated empirically by

$\log \mathrm{RMSE} \simeq -\alpha \log (\mathrm{epochs})$

with RMSE approaching $0.5\%$ after $500$ passes through a dataset of $N=10^4$ records, when $\mu=1$ [(Poluektov et al., 2023), section 4.2].

5. Practical Considerations for Implementation

Efficient implementation of the NK method for KA models is contingent on several choices:

Basis selection: Piecewise-linear functions $\phi^p$ , $\psi^l$ defined on uniform grids are recommended for their compact support, sparsity, and straightforward derivative calculation [(Poluektov et al., 2023), eqs. (25)-(27)].
Relaxation parameter: $\mu$ should be chosen in $(0,2)$ ; empirically, $\mu \approx 1$ achieves a favorable tradeoff between step size and noise filtering.
Initialization: The initial parameters $H^0$ , $G^0$ are sampled uniformly from ranges that scale with the data output $y_\mathrm{min}$ , $y_\mathrm{max}$ and model size, ensuring internal states remain within the region of basis support [(Poluektov et al., 2023), eq. (30)].
Regularization and model tuning: Validation-based selection of grid sizes $n$ , $s$ and number of terms (typically $2m+1$ for full KA) mitigates overfitting.
Stopping criteria: Convergence can be monitored by update norms, $|\Delta|$ , or residuals.

6. Comparative Analysis

In direct comparisons on synthetic regression tasks, the NK method demonstrates superior robustness and efficiency vis-à-vis the Gauss-Newton method, especially under poor initial guesses. Each NK update involves only a subset of the parameters and does not require storing or manipulating large Jacobian matrices, significantly lowering computational and memory requirements.

While the referenced work does not include partial differential equation (PDE)-based benchmarks or direct comparisons with modern multilayer perceptrons (MLPs) on massive datasets, it documents the theoretical scalability and empirical efficiency of the approach for high-dimensional, large-sample nonlinear regression (Poluektov et al., 2023). The explicit focus on basis expansions and single-equation update steps distinguishes the NK algorithm from other nonlinear solvers deployed in machine learning and scientific computing.

7. Future Perspectives and Limitations

The paper by Poluektov & Polar does not address parallel or block implementations of the NK algorithm. Extension to parallel or distributed environments, such as asynchronous or block-Kaczmarz schemes, remains an open avenue, with anticipated complexities in synchronization and communication.

A plausible implication is that advances in this direction could further reduce wall-clock times for massive datasets, though these must be validated in practice. The algorithm's empirical performance in PDEs, extreme dimension settings, or with real-world structured noise awaits further demonstration, as such applications are explicitly marked as outside the scope of the current study (Poluektov et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Construction of the Kolmogorov-Arnold representation using the Newton-Kaczmarz method (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Newton-Kaczmarz Algorithm.