Optimizers Qualitatively Alter Solutions And We Should Leverage This (2507.12224v1)

Published 16 Jul 2025 in cs.LG

Abstract: Due to the nonlinear nature of Deep Neural Networks (DNNs), one can not guarantee convergence to a unique global minimum of the loss when using optimizers relying only on local information, such as SGD. Indeed, this was a primary source of skepticism regarding the feasibility of DNNs in the early days of the field. The past decades of progress in deep learning have revealed this skepticism to be misplaced, and a large body of empirical evidence shows that sufficiently large DNNs following standard training protocols exhibit well-behaved optimization dynamics that converge to performant solutions. This success has biased the community to use convex optimization as a mental model for learning, leading to a focus on training efficiency, either in terms of required iteration, FLOPs or wall-clock time, when improving optimizers. We argue that, while this perspective has proven extremely fruitful, another perspective specific to DNNs has received considerably less attention: the optimizer not only influences the rate of convergence, but also the qualitative properties of the learned solutions. Restated, the optimizer can and will encode inductive biases and change the effective expressivity of a given class of models. Furthermore, we believe the optimizer can be an effective way of encoding desiderata in the learning process. We contend that the community should aim at understanding the biases of already existing methods, as well as aim to build new optimizers with the explicit intent of inducing certain properties of the solution, rather than solely judging them based on their convergence rates. We hope our arguments will inspire research to improve our understanding of how the learning process can impact the type of solution we converge to, and lead to a greater recognition of optimizers design as a critical lever that complements the roles of architecture and data in shaping model outcomes.

Summary

The paper shows that optimizer choice acts as an inductive bias, fundamentally altering solution properties like sparsity and robustness.
Empirical results reveal that non-diagonal preconditioners lead to more compact, lower-rank representations, reducing catastrophic forgetting.
The study advocates for treating optimizer selection as a design axis, emphasizing its role in shaping effective expressivity alongside architecture and data.

Optimizers as Inductive Bias: Their Qualitative Impact on Learned Solutions

This paper challenges the prevailing paradigm in deep learning optimization, which has historically prioritized convergence speed and efficiency, by arguing that the choice of optimizer fundamentally shapes the qualitative properties of the solutions found by neural networks. The authors contend that optimizers are not merely tools for efficient minimization of loss, but are critical sources of inductive bias that can alter the effective expressivity of a model class and the nature of the solutions reachable through learning.

Summary of Core Arguments

The central thesis is twofold:

Optimizers as Inductive Bias: The optimizer, like architecture and data, encodes inductive biases that influence not just generalization but also properties such as sparsity, representation structure, and robustness to catastrophic forgetting. The optimizer modulates the credit assignment mechanism, thereby shaping the representations and solutions learned.
Optimizer-Dependent Expressivity: The effective expressivity of a neural network is not solely determined by its architecture and data, but also by the optimizer. The set of functions a model can realize in principle (expressivity) is distinct from the set of functions it can reach in practice (reachable set), which is constrained by the optimizer and training protocol.

Empirical and Theoretical Evidence

The paper provides concrete examples to support its claims:

1. Non-Diagonal Preconditioners in Continual Learning

Observation: Second-order optimizers with non-diagonal preconditioners (e.g., Shampoo, K-FAC) lead to more localized, lower-rank representations and reduced catastrophic forgetting compared to first-order methods (e.g., SGD, AdamW).
Mechanism: Non-diagonal preconditioners capture parameter interactions, allowing updates that avoid "wasteful" movement in parameter space. This results in more compact representations, which are beneficial for continual learning as they reduce interference between tasks.
Empirical Results: On sequential variants of MNIST, models trained with Shampoo exhibit higher average accuracy across tasks and lower effective rank in their learned representations. In class-incremental settings, Shampoo-trained networks show less performance degradation on earlier tasks and less degenerate feature representations compared to Adam.

2. Sparsity via Optimizer-Induced Preconditioning

Observation: The Power-propagation method, originally framed as a reparameterization to induce sparsity, can be equivalently viewed as a specific choice of preconditioner in the optimizer.
Mechanism: By scaling updates according to parameter magnitude, the optimizer can create "saddles" in the loss landscape that make it difficult for small weights to move away from zero, thus promoting sparsity.
Implementation: Instead of reparameterizing the model, one can directly modify the optimizer to use a preconditioner $P = \text{diag}(|\theta|^\beta)$ , achieving similar sparsity-inducing dynamics with potentially lower computational overhead.

Implications and Claims

The paper makes several strong claims:

Optimizer choice can qualitatively alter the solution: Different optimizers can converge to minima with distinct properties, even when starting from the same initialization and using the same architecture and data.
Inductive bias via optimization is underexplored: The community has focused on architectural and data-centric inductive biases, neglecting the optimizer as a vehicle for encoding desired solution properties.
Expressivity arguments must include optimization: Theoretical discussions of model expressivity (e.g., Turing completeness) are incomplete if they ignore the constraints imposed by the optimizer and training dynamics.

Practical Considerations

Optimizer selection as a design axis: For practitioners, the optimizer should be considered alongside architecture and data when targeting specific solution properties (e.g., robustness, sparsity, transferability).
Preconditioner design: Custom preconditioners can be engineered to bias learning toward desired qualitative properties, even at the expense of convergence speed.
Large-scale and transfer learning: In scenarios where architecture modification is infeasible (e.g., finetuning large pretrained models), optimizer choice becomes the primary lever for introducing new inductive biases.

Limitations and Counterarguments

The authors acknowledge that many optimizer-induced effects can, in principle, be replicated via reparameterization. However, they argue that in practice, optimizer-based approaches may be more efficient or feasible, especially in transfer learning contexts. They also note the challenge of determining which inductive biases should be engineered versus learned from data, but maintain that all current systems embody some form of bias, whether explicit or implicit.

Future Directions

Systematic paper of optimizer-induced biases: There is a need for a more rigorous taxonomy and empirical investigation of the qualitative effects different optimizers induce across tasks and architectures.
Co-design of architecture and optimizer: The interplay between model structure and optimization dynamics should be explored to jointly achieve desired solution properties.
Meta-optimization and learning-to-learn: Meta-learning frameworks could be leveraged to automatically discover optimizers that encode specific inductive biases, potentially leading to more robust and adaptable AI systems.

Conclusion

This work reframes the role of optimization in deep learning, advocating for a shift from a narrow focus on convergence speed to a broader perspective that recognizes the optimizer as a key determinant of solution quality and model behavior. By leveraging optimizer-induced inductive biases, researchers and practitioners can more effectively shape the properties of learned models, opening new avenues for both theoretical understanding and practical algorithm design.

PDF Markdown

Follow-up Questions

Related Papers

Tweets

https://twitter.com/MirceaSci/status/1945944669667397811

https://twitter.com/fly51fly/status/1946687017083134165

https://twitter.com/rosinality/status/1945707535883006434

https://twitter.com/arxivsanitybot/status/1946206027382165979

https://twitter.com/tarantulae/status/1947579304113803583