Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sprecher Networks: Efficient Neural Spline Models

Updated 29 December 2025
  • Sprecher Networks are neural architectures based on KAS theory that use shared spline functions and affine transformations for universal function approximation.
  • Their structured blocks, including explicit shift parameters and lateral mixing, achieve high expressivity with linear parameter and memory scaling.
  • Empirical evaluations demonstrate that SNs outperform traditional MLPs and KANs in synthetic regression, tabular data, and high-dimensional classification tasks.

Sprecher Networks (SNs) are a family of neural architectures grounded in the Kolmogorov-Arnold-Sprecher (KAS) theory of multivariate function representation. These networks provide universal approximation capabilities via a parameter-efficient formulation based on a shared, learnable univariate basis—a construction directly inspired by Sprecher's refinement of the superposition theorem. By leveraging shared splines, explicit shift parameters, and mixing weights within structured blocks, SNs achieve the expressivity of Kolmogorov-Arnold Networks (KANs) while scaling parameter and memory requirements linearly in network width, thus enabling deep architectures even in high-dimensional regimes (Eliasson, 9 Dec 2025, Hägg et al., 22 Dec 2025).

1. Theoretical Foundation: Kolmogorov–Arnold–Sprecher Theorems

Classical KAS theory establishes that every continuous function f:[0,1]dRf : [0,1]^d \to \mathbb{R} can be exactly represented as a finite superposition of univariate functions. The Kolmogorov–Arnold (1963) formulation states: f(x1,,xd)=q=12d+1Φq(p=1dφp,q(xp))f(x_1, \ldots, x_d) = \sum_{q=1}^{2d+1} \Phi_q \left( \sum_{p=1}^{d} \varphi_{p,q}(x_p) \right) where the φp,q\varphi_{p,q} and Φq\Phi_q are continuous univariate maps. Sprecher (1965) refined this result, demonstrating that all inner branches can share a single function up to linear shift and scaling: f(x1,,xd)=q=12d+1Φ(p=1dλpqψ(xp+ϵq))f(x_1, \ldots, x_d) = \sum_{q=1}^{2d+1} \Phi \left( \sum_{p=1}^d \lambda^{p \cdot q} \psi(x_p + \epsilon q) \right) where ψ\psi (the "parent" function) and Φ\Phi are continuous, and the weights λ\lambda and shifts ϵ\epsilon are constants. This result motivates architectures that use shared basis functions and affine transformations to approximate any continuous multivariate map via compositions of shifted and linearly-mixed univariate splines (Eliasson, 9 Dec 2025, Hägg et al., 22 Dec 2025).

2. Core Architecture: The Sprecher Block

A Sprecher Network is constructed from blocks that implement the sum-of-shifted-splines strategy. Each block maps an input vector xRd1\mathbf{x} \in \mathbb{R}^{d_{\ell-1}} to hRd\mathbf{h} \in \mathbb{R}^{d_\ell} using shared, learnable inner and outer spline functions, channel-wise mixing, and explicit channel shifts. The parameterization comprises:

  • Inner monotonic spline ϕ():R[0,1]\phi^{(\ell)} : \mathbb{R} \to [0,1] (with GG knots)
  • Outer general spline Φ():RR\Phi^{(\ell)} : \mathbb{R} \to \mathbb{R} (with GG knots)
  • Weight vector λ()Rd1\lambda^{(\ell)} \in \mathbb{R}^{d_{\ell-1}}
  • Shift parameter η()R\eta^{(\ell)} \in \mathbb{R}
  • Optional lateral mixing scale τ()\tau^{(\ell)} and weights ω()\omega^{(\ell)}
  • Optional cyclic or linear residual connections R()R^{(\ell)}

For block output index qq, the mapping is: sq()=i=1d1λi()ϕ()(xi+η()q)+αq s~q()=sq()+τ()jN(q)ωq,j()sj() [B()(x)]q=Φ()(s~q())+R()(x)q\begin{aligned} s_q^{(\ell)} &= \sum_{i=1}^{d_{\ell-1}} \lambda_i^{(\ell)}\,\phi^{(\ell)}(x_i+\eta^{(\ell)}q) + \alpha\,q \ \tilde s_q^{(\ell)} &= s_q^{(\ell)} + \tau^{(\ell)}\sum_{j\in\mathcal N(q)}\omega^{(\ell)}_{q,j}\,s_j^{(\ell)} \ [B^{(\ell)}(\mathbf{x})]_q &= \Phi^{(\ell)}(\tilde s_q^{(\ell)}) + R^{(\ell)}(\mathbf{x})_q \end{aligned}

Here, N(q)\mathcal N(q) specifies the neighborhood for lateral mixing (e.g., cyclic neighbors), and α\alpha is a fixed output channel offset (typically $1$) (Hägg et al., 22 Dec 2025).

3. Deep Composition and Structural Enhancements

Deep SNs are constructed by stacking multiple Sprecher blocks. The final output for a scalar-valued function is obtained by summing over output channels in the last block; for vector-valued tasks, an additional block maps to the required output dimensionality.

Lateral mixing enables communication and parameter-sharing across output channels with O(N)\mathcal{O}(N) parameters, compared to the O(N2)\mathcal{O}(N^2) scaling of full attention. Cyclic or linear residual connections provide additional optimization stability and regularization benefits. Optional batch normalization layers (affine per-channel) may be placed before or after each block to enhance training dynamics (Hägg et al., 22 Dec 2025).

4. Parameter and Memory Complexity

Sprecher Networks circumvent the parameter inefficiency of standard KANs. Whereas naïve KANs require an individually parameterized spline per edge (yielding O(N2G)\mathcal{O}(N^2 G) parameters per layer for width NN and GG spline knots), SNs share splines across output channels and parameterize affine transformations per channel or edge:

Model Parameters per Layer Memory per Layer (Naïve) Sequential Memory (SN)
MLP O(LN2)\mathcal{O}(LN^2) O(N2)\mathcal{O}(N^2)
LAN O(LN2+LNG)\mathcal{O}(LN^2 + LNG) O(N2)\mathcal{O}(N^2)
KAN O(LN2G)\mathcal{O}(LN^2 G) O(N2G)\mathcal{O}(N^2 G)
SN O(LN+LG)\mathcal{O}(LN + LG) O(N2)\mathcal{O}(N^2) O(N)\mathcal{O}(N)

All parameterized splines (notably PCHIP or cubic B-splines) are shared within a block. Sequential evaluation of outputs allows peak forward memory per block to be O(N)\mathcal{O}(N) compared to O(N2)\mathcal{O}(N^2) for MLPs/KANs, allowing much wider layers under strict memory constraints (Hägg et al., 22 Dec 2025, Eliasson, 9 Dec 2025).

5. Empirical Evaluation and Functional Expressivity

SNs have been empirically benchmarked against MLPs, KANs, and learnable activation networks (LANs) on tasks spanning synthetic regression, tabular data, and high-dimensional classification:

  • Synthetic Function Approximation: SNs achieve MSEs competitive with or superior to parameter-matched KANs and MLPs. For example, GS-KAN (a Sprecher-type SN) achieves MSE 8.4×104\approx 8.4 \times 10^{-4} in a "nano" (200-parameter) regime for f(x,y)=sin(3πx)cos(3πy)+N(0,104)f(x,y)=\sin(3\pi x)\cos(3\pi y)+\mathcal{N}(0,10^{-4}), outperforming both MLPs and standard KANs (Eliasson, 9 Dec 2025, Hägg et al., 22 Dec 2025).
  • Tabular Regression: On the California Housing dataset, GS-KAN outperforms MLPs in all parameter regimes and matches or surpasses standard KANs, e.g., MSE0.290\approx0.290 for GS-KAN vs. $0.294$ for MLP (200-parameter regime).
  • High-Dimensional Classification: On Fashion-MNIST with din=784d_{in}=784 and \sim12.5K parameters, GS-KAN achieves accuracy 87.03%\approx 87.03\%, exceeding MLP accuracy (86.00%86.00\%), demonstrating scalability to high-dimensional domains without the prohibitive parameter explosion of KANs.
  • Physics-Informed and Quantile Tasks: SNs attain lower MSE than KANs for physics-informed PDE regression and dense quantile prediction under tight parameter budgets (Hägg et al., 22 Dec 2025).

Optional cyclic lateral mixing reduces MSE (e.g., on a 2→[10,10,10]→1 synthetic task, cyclic residual achieves 1.79×1031.79\times10^{-3} MSE vs.\ 4.33×1024.33\times10^{-2} for linear) while using an order of magnitude fewer parameters.

6. Scalability, Limitations, and Future Directions

Sprecher Networks decouple the spline basis capacity (determined by GG) from network width, permitting training under stringent memory and parameter constraints even with din104d_{in}\gg 10^4. Fixed spline domains (e.g., [3,3][-3,3]) combined with learned shifts and scales enable flexible mapping of features, though occasional out-of-domain inputs incur zero local gradient. Batch-level adaptation mitigates this.

Limitations include the use of fixed, uniform knot grids (which may underutilize spline capacity in regions of high nonlinearity) and the computational cost associated with recursive spline evaluation. The architecture remains fundamentally fully connected; integration with convolutional or attention-based patterns is a potential avenue for further research. Learnable knot positions and alternative smooth bases such as RBFs are listed as promising extensions (Eliasson, 9 Dec 2025, Hägg et al., 22 Dec 2025).

Sprecher Networks occupy a distinct region in the landscape of function-approximating architectures:

  • MLPs: Rely on fixed node activations and quadratic scaling in weight parameters.
  • KANs: Feature learnable edge activations with a quadratic (or higher) parameter count.
  • LANs: Use node-wise learnable activations but retain O(N2)\mathcal{O}(N^2) scaling with NN.
  • SNs: Realize the KAS universality in a parameter- and memory-efficient form, sharing splines blockwise with linear scaling in width and a minor spline overhead.

Empirical evidence shows that SNs maintain or improve upon the approximation capabilities of MLPs and KANs while imposing substantially reduced parameter and memory burdens, particularly for wide or deep network configurations (Eliasson, 9 Dec 2025, Hägg et al., 22 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sprecher Networks (SNs).