Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 30 tok/s Pro
2000 character limit reached

Optimizers Qualitatively Alter Solutions And We Should Leverage This (2507.12224v1)

Published 16 Jul 2025 in cs.LG

Abstract: Due to the nonlinear nature of Deep Neural Networks (DNNs), one can not guarantee convergence to a unique global minimum of the loss when using optimizers relying only on local information, such as SGD. Indeed, this was a primary source of skepticism regarding the feasibility of DNNs in the early days of the field. The past decades of progress in deep learning have revealed this skepticism to be misplaced, and a large body of empirical evidence shows that sufficiently large DNNs following standard training protocols exhibit well-behaved optimization dynamics that converge to performant solutions. This success has biased the community to use convex optimization as a mental model for learning, leading to a focus on training efficiency, either in terms of required iteration, FLOPs or wall-clock time, when improving optimizers. We argue that, while this perspective has proven extremely fruitful, another perspective specific to DNNs has received considerably less attention: the optimizer not only influences the rate of convergence, but also the qualitative properties of the learned solutions. Restated, the optimizer can and will encode inductive biases and change the effective expressivity of a given class of models. Furthermore, we believe the optimizer can be an effective way of encoding desiderata in the learning process. We contend that the community should aim at understanding the biases of already existing methods, as well as aim to build new optimizers with the explicit intent of inducing certain properties of the solution, rather than solely judging them based on their convergence rates. We hope our arguments will inspire research to improve our understanding of how the learning process can impact the type of solution we converge to, and lead to a greater recognition of optimizers design as a critical lever that complements the roles of architecture and data in shaping model outcomes.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that optimizer choice encodes inductive bias, leading to qualitatively distinct neural network solutions.
  • Empirical analysis reveals that advanced methods like Shampoo reduce catastrophic forgetting by shaping lower-dimensional representations.
  • The study advocates leveraging optimizer design to induce target properties such as sparsity and robustness without modifying network architecture.

Optimizers as Inductive Bias: Their Qualitative Impact on Neural Network Solutions

The paper "Optimizers Qualitatively Alter Solutions And We Should Leverage This" (2507.12224) presents a comprehensive argument that the choice of optimizer in neural network training is not merely a matter of convergence speed or computational efficiency, but fundamentally shapes the qualitative properties of the solutions found. The authors contend that optimizers encode inductive biases, influence the effective expressivity of model classes, and can be leveraged to induce desirable properties in learned representations—an aspect that has been underappreciated relative to architectural and data-centric approaches.

Reframing the Role of Optimizers

Historically, the deep learning community has drawn heavily from convex optimization, focusing on convergence guarantees and efficiency. This perspective, while fruitful, has led to a bias: optimizers are often viewed as tools for faster or more stable minimization of loss, with little attention paid to the nature of the solutions they produce. The paper challenges this view, arguing that in the non-convex landscapes characteristic of neural networks, different optimizers can lead to qualitatively distinct minima, even when starting from identical initializations and using the same data and architecture.

The authors emphasize two central claims:

  1. Optimizers as Inductive Bias: The optimizer, like architecture and data, is a source of inductive bias. Its design can be used to favor solutions with specific properties, such as sparsity, robustness, or reduced interference in continual learning.
  2. Optimizer-Dependent Expressivity: The effective expressivity of a model class is not determined solely by its architecture; the optimizer constrains which functions are reachable during training, thus shaping the set of solutions that can be realized in practice.

Empirical and Theoretical Evidence

The paper substantiates its thesis through both theoretical discussion and empirical demonstrations. Two primary examples are provided:

1. Non-Diagonal Preconditioners in Continual Learning

The authors analyze the impact of second-order optimizers with non-diagonal preconditioners (e.g., Shampoo, K-FAC) in continual learning settings. They demonstrate that such optimizers, by accounting for parameter interactions, reduce "wasteful" movement in parameter space, leading to more localized and lower-dimensional representations. This, in turn, mitigates catastrophic forgetting and interference between tasks.

Empirical results on permuted and class-incremental MNIST show that models trained with Shampoo exhibit higher average accuracy across tasks and lower effective rank in their learned representations compared to those trained with SGD or AdamW. Notably, the reduction in forgetting is not attributable to convergence speed, but to the optimizer's qualitative influence on representation structure.

2. Sparsity via Preconditioner Design

The Power-propagation method, originally proposed as a reparameterization to induce sparsity, is reinterpreted as a specific choice of preconditioner. By scaling parameter updates according to their magnitude, the optimizer can bias learning toward sparse solutions, independent of explicit regularization. This reframing highlights that optimizer design can be a more natural and efficient vehicle for certain inductive biases than architectural changes or additive regularization.

Implications for Model Selection and Generalization

The paper's perspective has several important implications:

  • Model Selection: Expressivity arguments that ignore the optimizer may be misleading. For example, while RNNs are theoretically Turing complete, the vanishing/exploding gradient problem can render certain functions unreachable in practice with standard optimizers.
  • Inductive Bias Beyond Architecture: In scenarios where architectural modification is infeasible (e.g., fine-tuning large pretrained models), optimizer choice becomes a primary lever for introducing new biases or desiderata.
  • Generalization and OOD Behavior: The optimizer's implicit regularization can affect not only in-domain generalization but also properties such as compositionality, robustness, and the ability to learn algorithmic or causal structures.

Limitations and Counterarguments

The authors acknowledge that many inductive biases achievable via optimizer design can also be realized through reparameterization. However, they argue that the optimizer-centric view may offer practical advantages, especially in settings where architecture is fixed. They also recognize the challenge of determining which biases should be engineered versus learned from data, but maintain that making these biases explicit is preferable to relying on implicit, unexamined choices.

Future Directions

The paper advocates for a research agenda that treats optimizer design as a first-class citizen in the development of learning systems. This includes:

  • Systematic characterization of the inductive biases of existing and novel optimizers.
  • Development of optimizers tailored to induce specific solution properties (e.g., sparsity, modularity, robustness).
  • Integration of optimizer choice into model selection and evaluation pipelines, alongside architecture and data considerations.
  • Exploration of the interplay between optimizer-induced and architecture-induced expressivity, particularly in the context of large-scale, pretrained, or continually learned models.

Conclusion

This work provides a compelling case for expanding the community's focus beyond convergence speed and efficiency in optimizer research. By recognizing and leveraging the qualitative impact of optimizers on learned solutions, researchers can unlock new avenues for controlling and understanding the behavior of deep learning systems. The optimizer, far from being a neutral tool, is a powerful mechanism for shaping the inductive biases and effective capabilities of neural networks.