- The paper demonstrates that optimizer choice encodes inductive bias, leading to qualitatively distinct neural network solutions.
- Empirical analysis reveals that advanced methods like Shampoo reduce catastrophic forgetting by shaping lower-dimensional representations.
- The study advocates leveraging optimizer design to induce target properties such as sparsity and robustness without modifying network architecture.
Optimizers as Inductive Bias: Their Qualitative Impact on Neural Network Solutions
The paper "Optimizers Qualitatively Alter Solutions And We Should Leverage This" (2507.12224) presents a comprehensive argument that the choice of optimizer in neural network training is not merely a matter of convergence speed or computational efficiency, but fundamentally shapes the qualitative properties of the solutions found. The authors contend that optimizers encode inductive biases, influence the effective expressivity of model classes, and can be leveraged to induce desirable properties in learned representations—an aspect that has been underappreciated relative to architectural and data-centric approaches.
Reframing the Role of Optimizers
Historically, the deep learning community has drawn heavily from convex optimization, focusing on convergence guarantees and efficiency. This perspective, while fruitful, has led to a bias: optimizers are often viewed as tools for faster or more stable minimization of loss, with little attention paid to the nature of the solutions they produce. The paper challenges this view, arguing that in the non-convex landscapes characteristic of neural networks, different optimizers can lead to qualitatively distinct minima, even when starting from identical initializations and using the same data and architecture.
The authors emphasize two central claims:
- Optimizers as Inductive Bias: The optimizer, like architecture and data, is a source of inductive bias. Its design can be used to favor solutions with specific properties, such as sparsity, robustness, or reduced interference in continual learning.
- Optimizer-Dependent Expressivity: The effective expressivity of a model class is not determined solely by its architecture; the optimizer constrains which functions are reachable during training, thus shaping the set of solutions that can be realized in practice.
Empirical and Theoretical Evidence
The paper substantiates its thesis through both theoretical discussion and empirical demonstrations. Two primary examples are provided:
1. Non-Diagonal Preconditioners in Continual Learning
The authors analyze the impact of second-order optimizers with non-diagonal preconditioners (e.g., Shampoo, K-FAC) in continual learning settings. They demonstrate that such optimizers, by accounting for parameter interactions, reduce "wasteful" movement in parameter space, leading to more localized and lower-dimensional representations. This, in turn, mitigates catastrophic forgetting and interference between tasks.
Empirical results on permuted and class-incremental MNIST show that models trained with Shampoo exhibit higher average accuracy across tasks and lower effective rank in their learned representations compared to those trained with SGD or AdamW. Notably, the reduction in forgetting is not attributable to convergence speed, but to the optimizer's qualitative influence on representation structure.
2. Sparsity via Preconditioner Design
The Power-propagation method, originally proposed as a reparameterization to induce sparsity, is reinterpreted as a specific choice of preconditioner. By scaling parameter updates according to their magnitude, the optimizer can bias learning toward sparse solutions, independent of explicit regularization. This reframing highlights that optimizer design can be a more natural and efficient vehicle for certain inductive biases than architectural changes or additive regularization.
Implications for Model Selection and Generalization
The paper's perspective has several important implications:
- Model Selection: Expressivity arguments that ignore the optimizer may be misleading. For example, while RNNs are theoretically Turing complete, the vanishing/exploding gradient problem can render certain functions unreachable in practice with standard optimizers.
- Inductive Bias Beyond Architecture: In scenarios where architectural modification is infeasible (e.g., fine-tuning large pretrained models), optimizer choice becomes a primary lever for introducing new biases or desiderata.
- Generalization and OOD Behavior: The optimizer's implicit regularization can affect not only in-domain generalization but also properties such as compositionality, robustness, and the ability to learn algorithmic or causal structures.
Limitations and Counterarguments
The authors acknowledge that many inductive biases achievable via optimizer design can also be realized through reparameterization. However, they argue that the optimizer-centric view may offer practical advantages, especially in settings where architecture is fixed. They also recognize the challenge of determining which biases should be engineered versus learned from data, but maintain that making these biases explicit is preferable to relying on implicit, unexamined choices.
Future Directions
The paper advocates for a research agenda that treats optimizer design as a first-class citizen in the development of learning systems. This includes:
- Systematic characterization of the inductive biases of existing and novel optimizers.
- Development of optimizers tailored to induce specific solution properties (e.g., sparsity, modularity, robustness).
- Integration of optimizer choice into model selection and evaluation pipelines, alongside architecture and data considerations.
- Exploration of the interplay between optimizer-induced and architecture-induced expressivity, particularly in the context of large-scale, pretrained, or continually learned models.
Conclusion
This work provides a compelling case for expanding the community's focus beyond convergence speed and efficiency in optimizer research. By recognizing and leveraging the qualitative impact of optimizers on learned solutions, researchers can unlock new avenues for controlling and understanding the behavior of deep learning systems. The optimizer, far from being a neutral tool, is a powerful mechanism for shaping the inductive biases and effective capabilities of neural networks.