- The paper identifies channels to infinity, where duplicating neurons leads to convergent input weights and diverging output weights that approximate gated linear units.
- It employs high-precision gradient flow simulations to empirically confirm that these flat, symmetry-induced saddle lines are prevalent across diverse network architectures.
- The study shows that standard gradient-based optimizers tend to get stuck in these channels due to extreme flatness, offering insights into optimization behavior and potential generalization benefits.
This paper investigates the structure of the loss landscapes in neural networks, specifically focusing on fully connected layers. It identifies and characterizes a novel feature: "channels to infinity." These are specific regions in the high-dimensional parameter space along which the training loss decreases extremely slowly, leading towards minima that exist at infinite parameter norm.
The concept builds upon prior work showing that duplicating a neuron in a trained neural network creates a line of critical points (often saddle points) in the loss landscape of the expanded network [fukumizu2000local, fukumizu2019semi, simsek2021geometry]. The paper empirically confirms the existence of these "saddle lines" and shows that, in certain settings (like MLPs without biases), they can contain stable regions called "plateau saddles" where gradient flow converges (\autoref{fig2}, \autoref{fig3}). However, they note that adding biases seems to significantly reduce the occurrence of these stable plateau saddles (\autoref{app1}), aligning with theoretical predictions for multi-layer or multi-output networks [petzka2021non].
The core contribution is the identification of "channels to infinity," which are shown to be asymptotically parallel to these symmetry-induced saddle lines (\autoref{fig1}). These channels are characterized by the behavior of at least two neurons within a layer: their input weight vectors become increasingly similar (converging), while their corresponding output weights diverge to positive and negative infinity, respectively, in a specific ratio (\autoref{fig4}).
The paper provides substantial empirical evidence for the prevalence of these channels. Using gradient flow simulations with high-precision ODE solvers [breamlpgradientflowgoingflow2023a], the authors show that gradient-based optimization methods reach these channels with high probability across diverse regression tasks, network architectures (varying width, depth), and input dimensions (\autoref{fig5}b). These channels are observed in different layers of deep networks (\autoref{fig5}f) and can involve more than two neurons, forming multi-dimensional channels (\autoref{fig5}g, \autoref{app9}, \autoref{app10}).
Crucially, the paper offers a functional interpretation for these channels. As the parameter norm diverges along a channel (i.e., the input weights become equal and output weights diverge), the combined contribution of the involved neurons in the network's output function converges to a specific form: a gated linear unit (GLU). Specifically, for two neurons i and j with output weights ai,aj and input weights wi,wj, their combined contribution aiσ(wi⋅x)+ajσ(wj⋅x) approaches cσ(w⋅x)+v⋅xσ′(w⋅x) as ∥wi−wj∥→0 and ∣ai∣+∣aj∣→∞. Here, w=(wi+wj)/2, c=ai+aj, and v is related to the limit of (ai−aj)(wi−wj). If the activation function σ is softplus, its derivative σ′ is the standard sigmoid, yielding a standard GLU structure [dauphin2017language]. The convergence to this functional form is shown to incur an O(ϵ2) error in the network output and loss, where ϵ=∥wi−wj∥/2 (\autoref{eq:central_difference_appendix}, \autoref{eq:loss_eps_expansion_appendix}).
From an optimization perspective, the channels are characterized by extreme flatness along the direction of the saddle line (the direction where parameters diverge), while becoming progressively sharper perpendicular to this direction as the parameter norm increases (\autoref{fig5}e). Standard gradient-based optimizers like SGD or ADAM, while often reaching the "entrance" of these channels, tend to get stuck there because the gradient magnitude becomes very small along the flat direction (\autoref{fig6}b). This makes the solutions appear like flat local minima with finite parameter values, even though they are on a path leading to a minimum at infinity. The paper used a "jump procedure" to traverse the channel faster and show that the loss continues to decrease, and the GLU approximation improves, as the parameter norm increases (\autoref{fig6}a, \autoref{app:jump}).
Practical Implications and Implementation:
- Understanding Learned Functions: The emergence of GLUs (or higher-order derivative structures for multi-neuron channels) from standard training is a surprising finding. This suggests that even simple fully connected layers, when trained with gradient methods, can implicitly learn more complex, derivative-like computational units by driving specific neurons towards the channel to infinity configuration.
- Interpreting Trained Models: Identifying neurons involved in channels can provide insights into the network's learned representation and functional decomposition. This can be done by examining the learned weights: look for pairs (or larger groups) of neurons in a layer whose input weight vectors are very similar (e.g., check cosine distance) and whose output weights have large magnitudes and opposite signs.
- Optimization Behavior: The observation that standard optimizers get "stuck" early in channels explains why networks might converge to points that appear flat. This perceived flatness in certain directions could potentially relate to good generalization properties, as suggested by sharpness-aware minimization literature [foret2020sharpness].
- Diagnostic Tools: High-precision ODE solvers like those in MLPGradientFlow.jl [breamlpgradientflowgoingflow2023a] are valuable research tools for precisely analyzing gradient flow dynamics and identifying these structures, although computationally expensive for large-scale training. For practical applications, monitoring the parameter norms and the convergence of input/output weight pairs of neurons could serve as a heuristic indicator of entering a channel.
- Limitations: The primary analysis focuses on single-hidden-layer MLPs with smooth activations and regression. While the paper argues the mechanism is general, empirical verification across non-smooth activations (like ReLU), classification tasks, and different architectures (CNNs, Transformers) is needed. Identifying the specific saddle line associated with a converged channel remains an open challenge (\autoref{app:channels}).
In summary, the paper "Flat Channels to Infinity in Neural Loss Landscapes" (2506.14951) reveals that neural networks trained with gradient descent can implicitly learn functional forms resembling gated linear units by traversing specific parameter paths (channels) leading towards infinity. These channels are extremely flat in certain directions, causing standard optimizers to converge slowly or get stuck early, making them appear as flat local minima. This phenomenon provides a new perspective on the benign nature of neural network loss landscapes and highlights the unexpected computational capabilities arising from simple neuron interactions.