Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks (1709.06079v2)

Published 16 Sep 2017 in cs.LG

Abstract: Orthogonal matrix has shown advantages in training Recurrent Neural Networks (RNNs), but such matrix is limited to be square for the hidden-to-hidden transformation in RNNs. In this paper, we generalize such square orthogonal matrix to orthogonal rectangular matrix and formulating this problem in feed-forward Neural Networks (FNNs) as Optimization over Multiple Dependent Stiefel Manifolds (OMDSM). We show that the rectangular orthogonal matrix can stabilize the distribution of network activations and regularize FNNs. We also propose a novel orthogonal weight normalization method to solve OMDSM. Particularly, it constructs orthogonal transformation over proxy parameters to ensure the weight matrix is orthogonal and back-propagates gradient information through the transformation during training. To guarantee stability, we minimize the distortions between proxy parameters and canonical weights over all tractable orthogonal transformations. In addition, we design an orthogonal linear module (OLM) to learn orthogonal filter banks in practice, which can be used as an alternative to standard linear module. Extensive experiments demonstrate that by simply substituting OLM for standard linear module without revising any experimental protocols, our method largely improves the performance of the state-of-the-art networks, including Inception and residual networks on CIFAR and ImageNet datasets. In particular, we have reduced the test error of wide residual network on CIFAR-100 from 20.04% to 18.61% with such simple substitution. Our code is available online for result reproduction.

Citations (215)

Summary

  • The paper presents a novel approach that frames weight constraints as optimization over multiple dependent Stiefel manifolds, extending orthogonality from square to rectangular matrices.
  • It introduces an orthogonal linear module (OLM) that parameterizes weights to ensure orthogonality during back-propagation, leading to more stable activations and improved convergence.
  • Empirical results demonstrate that integrating OLM with standard techniques like batch normalization and Adam optimization significantly reduces test errors, as shown on CIFAR-100 and other benchmarks.

Orthogonal Weight Normalization: Solving Optimization over Multiple Dependent Stiefel Manifolds

This paper introduces a novel orthogonal weight normalization methodology aimed at addressing the optimization challenge over multiple dependent Stiefel manifolds (OMDSM) in deep neural networks. The proposed solution extends the notion of orthogonal matrices, traditionally used in RNNs as square matrices, to more general rectangular matrices applicable to feed-forward neural networks (FNNs). This generalization is particularly beneficial for stabilizing neural network activations and regularizing networks, thereby addressing challenges associated with optimization and overfitting in deep learning.

The authors formulate the problem of learning orthogonal filters as optimization over these dependent Stiefel manifolds, wherein each layer's weight matrix is constrained to be orthogonal. This constraint transforms typical full-rank spaces into lower-dimensional submanifolds. To operationalize this in the context of training deep networks, a novel orthogonal weight normalization method is introduced. This method involves parameterizing weights through a transformation that guarantees orthogonality during the learning process. Importantly, it back-propagates gradients through this transformation, ensuring stability and efficient convergence.

A key component of the proposed method is the orthogonal linear module (OLM), which can replace standard linear modules in practice. The OLM is adept at stabilizing activations through layers, making the optimization process more efficient without modifying existing training protocols. When tested on state-of-the-art architectures like Inception and residual networks, it showed marked improvements in test error across datasets, notably reducing test error from 20.04% to 18.61% on CIFAR-100 for the wide residual network.

Intuitively, the employment of orthogonal matrices supports energy preservation — a property also beneficial in filter banks for signal processing. The introduction of orthogonal constraints ensures orthonormality among filters, providing intrinsic regularization without compromising on efficient convergence. The paper highlights that common Riemannian optimization approaches, while capable of dealing with single or independent manifold structures, exhibit instability when tackling layers within DNNs that are dependent.

Empirically, OLM’s practical utility was confirmed through comprehensive experiments on MLPS and CNNs, encompassing popular datasets like CIFAR-10, CIFAR-100, and ImageNet. Notably, the approach synergizes well with techniques like batch normalization and Adam optimization, further amplifying the performance gains.

This research posits that enforcing orthogonality via the proposed OLM not only supports faster convergence but can substantially enhance the generalization and robustness of neural networks. As deep learning continues to delve into increasingly complex models and datasets, these methodologies have the potential to bolster both theoretical understanding and practical application. Future work could extend this framework to semi-supervised or unsupervised settings and explore implications for adversarial robustness. In summary, this work makes significant inroads in incorporating geometrically motivated constraints into deep network training, heralding new opportunities for enhanced model performance and stability.