Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks (1901.06523v7)

Published 19 Jan 2019 in cs.LG and stat.ML

Abstract: We study the training process of Deep Neural Networks (DNNs) from the Fourier analysis perspective. We demonstrate a very universal Frequency Principle (F-Principle) -- DNNs often fit target functions from low to high frequencies -- on high-dimensional benchmark datasets such as MNIST/CIFAR10 and deep neural networks such as VGG16. This F-Principle of DNNs is opposite to the behavior of most conventional iterative numerical schemes (e.g., Jacobi method), which exhibit faster convergence for higher frequencies for various scientific computing problems. With a simple theory, we illustrate that this F-Principle results from the regularity of the commonly used activation functions. The F-Principle implies an implicit bias that DNNs tend to fit training data by a low-frequency function. This understanding provides an explanation of good generalization of DNNs on most real datasets and bad generalization of DNNs on parity function or randomized dataset.

Citations (463)

View on Semantic Scholar

Summary

The paper introduces the Frequency Principle, showing that DNNs learn low-frequency components before high frequencies.
It employs projection and filtering methods to validate the low-to-high frequency fitting behavior across various architectures and datasets.
Findings reveal that activation function regularity creates an implicit bias, guiding future DNN design for enhanced performance.

Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks

The paper, "Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks," introduces a framework for understanding the training dynamics of Deep Neural Networks (DNNs) through a Fourier analysis lens. The authors propose a universal Frequency Principle (F-Principle), which posits that DNNs typically fit target functions from low to high frequencies during training. This principle provides an explanation for the generalization capabilities of DNNs on real datasets and their limitations on functions dominated by high frequencies.

Core Insights and Claims

The F-Principle contradicts conventional numerical schemes, such as the Jacobi method, which target higher frequencies earlier due to faster convergence. The difference is attributed to the inherent regularity of activation functions used in DNNs, leading to an implicit bias towards fitting low-frequency components first. This insight is critical for understanding why DNNs demonstrate good generalization on datasets like MNIST and CIFAR10, but struggle with parity functions or randomized datasets.

Methodology

The paper employs two primary methods to substantiate the F-Principle across high-dimensional benchmark datasets and various DNN architectures:

Projection Method: By observing DNN behavior through the directional Fourier transform, results across experiments with MNIST and CIFAR10 confirm the principle, showing that low-frequency components are fitted earlier in DNN training.
Filtering Method: This provides a broader examination by splitting frequencies into low and high components and measuring convergence speed. The paper finds a clear pattern where the low-frequency part converges faster, reaffirming the F-Principle.

Numerical Results

Experiments conducted with fully connected networks, CNNs, and complex architectures such as VGG16, consistently demonstrated the prevalence of the F-Principle. In tasks involving real-world datasets, adjustments to DNNs significantly affected their frequency learning order, but low-to-high frequency fitting remained consistent.

The paper further highlights the application of the F-Principle in solving differential equations, offering potential for improved numerical schemes by leveraging DNNs to target low frequencies initially, followed by conventional methods.

Theoretical Implications

A simplified theoretical framework is provided to elucidate the connection between activation function smoothness and gradient priorities in the frequency domain. This account is supported by future rigorous mathematical analysis, asserting that the regularity of activations influences the decay rate of loss functions at different frequencies.

Generalization and Aliasing

The generalization capacity of DNNs is re-examined under the F-Principle. It is noted that DNNs efficiently generalize on datasets dominated by low frequencies, aligning with the F-Principle, whereas high-frequency dominant problems, such as the parity function, lead to poor generalization due to aliasing effects.

Future Directions

The insights from this research pave the way for future exploration into the design and optimization of DNNs by shifting focus towards understanding their inherent frequency-fitting biases. Improved network architectures and training protocols that harness the F-Principle could offer enhanced performance in specific applications, especially where conventional iterative methods fall short.

This paper's investigation into the frequency dynamics of DNNs contributes significantly to the theoretical and practical understanding of deep learning, suggesting a new paradigm for analyzing and utilizing neural networks in complex domains.

PDF Markdown