- The paper shows that optimizer choice acts as an inductive bias, fundamentally altering solution properties like sparsity and robustness.
- Empirical results reveal that non-diagonal preconditioners lead to more compact, lower-rank representations, reducing catastrophic forgetting.
- The study advocates for treating optimizer selection as a design axis, emphasizing its role in shaping effective expressivity alongside architecture and data.
Optimizers as Inductive Bias: Their Qualitative Impact on Learned Solutions
This paper challenges the prevailing paradigm in deep learning optimization, which has historically prioritized convergence speed and efficiency, by arguing that the choice of optimizer fundamentally shapes the qualitative properties of the solutions found by neural networks. The authors contend that optimizers are not merely tools for efficient minimization of loss, but are critical sources of inductive bias that can alter the effective expressivity of a model class and the nature of the solutions reachable through learning.
Summary of Core Arguments
The central thesis is twofold:
- Optimizers as Inductive Bias: The optimizer, like architecture and data, encodes inductive biases that influence not just generalization but also properties such as sparsity, representation structure, and robustness to catastrophic forgetting. The optimizer modulates the credit assignment mechanism, thereby shaping the representations and solutions learned.
- Optimizer-Dependent Expressivity: The effective expressivity of a neural network is not solely determined by its architecture and data, but also by the optimizer. The set of functions a model can realize in principle (expressivity) is distinct from the set of functions it can reach in practice (reachable set), which is constrained by the optimizer and training protocol.
Empirical and Theoretical Evidence
The paper provides concrete examples to support its claims:
1. Non-Diagonal Preconditioners in Continual Learning
- Observation: Second-order optimizers with non-diagonal preconditioners (e.g., Shampoo, K-FAC) lead to more localized, lower-rank representations and reduced catastrophic forgetting compared to first-order methods (e.g., SGD, AdamW).
- Mechanism: Non-diagonal preconditioners capture parameter interactions, allowing updates that avoid "wasteful" movement in parameter space. This results in more compact representations, which are beneficial for continual learning as they reduce interference between tasks.
- Empirical Results: On sequential variants of MNIST, models trained with Shampoo exhibit higher average accuracy across tasks and lower effective rank in their learned representations. In class-incremental settings, Shampoo-trained networks show less performance degradation on earlier tasks and less degenerate feature representations compared to Adam.
2. Sparsity via Optimizer-Induced Preconditioning
- Observation: The Power-propagation method, originally framed as a reparameterization to induce sparsity, can be equivalently viewed as a specific choice of preconditioner in the optimizer.
- Mechanism: By scaling updates according to parameter magnitude, the optimizer can create "saddles" in the loss landscape that make it difficult for small weights to move away from zero, thus promoting sparsity.
- Implementation: Instead of reparameterizing the model, one can directly modify the optimizer to use a preconditioner P=diag(∣θ∣β), achieving similar sparsity-inducing dynamics with potentially lower computational overhead.
Implications and Claims
The paper makes several strong claims:
- Optimizer choice can qualitatively alter the solution: Different optimizers can converge to minima with distinct properties, even when starting from the same initialization and using the same architecture and data.
- Inductive bias via optimization is underexplored: The community has focused on architectural and data-centric inductive biases, neglecting the optimizer as a vehicle for encoding desired solution properties.
- Expressivity arguments must include optimization: Theoretical discussions of model expressivity (e.g., Turing completeness) are incomplete if they ignore the constraints imposed by the optimizer and training dynamics.
Practical Considerations
- Optimizer selection as a design axis: For practitioners, the optimizer should be considered alongside architecture and data when targeting specific solution properties (e.g., robustness, sparsity, transferability).
- Preconditioner design: Custom preconditioners can be engineered to bias learning toward desired qualitative properties, even at the expense of convergence speed.
- Large-scale and transfer learning: In scenarios where architecture modification is infeasible (e.g., finetuning large pretrained models), optimizer choice becomes the primary lever for introducing new inductive biases.
Limitations and Counterarguments
The authors acknowledge that many optimizer-induced effects can, in principle, be replicated via reparameterization. However, they argue that in practice, optimizer-based approaches may be more efficient or feasible, especially in transfer learning contexts. They also note the challenge of determining which inductive biases should be engineered versus learned from data, but maintain that all current systems embody some form of bias, whether explicit or implicit.
Future Directions
- Systematic paper of optimizer-induced biases: There is a need for a more rigorous taxonomy and empirical investigation of the qualitative effects different optimizers induce across tasks and architectures.
- Co-design of architecture and optimizer: The interplay between model structure and optimization dynamics should be explored to jointly achieve desired solution properties.
- Meta-optimization and learning-to-learn: Meta-learning frameworks could be leveraged to automatically discover optimizers that encode specific inductive biases, potentially leading to more robust and adaptable AI systems.
Conclusion
This work reframes the role of optimization in deep learning, advocating for a shift from a narrow focus on convergence speed to a broader perspective that recognizes the optimizer as a key determinant of solution quality and model behavior. By leveraging optimizer-induced inductive biases, researchers and practitioners can more effectively shape the properties of learned models, opening new avenues for both theoretical understanding and practical algorithm design.