Deep Learning and Computational Physics (Lecture Notes) (2301.00942v1)

Published 3 Jan 2023 in cs.LG, math-ph, and math.MP

Abstract: These notes were compiled as lecture notes for a course developed and taught at the University of the Southern California. They should be accessible to a typical engineering graduate student with a strong background in Applied Mathematics. The main objective of these notes is to introduce a student who is familiar with concepts in linear algebra and partial differential equations to select topics in deep learning. These lecture notes exploit the strong connections between deep learning algorithms and the more conventional techniques of computational physics to achieve two goals. First, they use concepts from computational physics to develop an understanding of deep learning algorithms. Not surprisingly, many concepts in deep learning can be connected to similar concepts in computational physics, and one can utilize this connection to better understand these algorithms. Second, several novel deep learning algorithms can be used to solve challenging problems in computational physics. Thus, they offer someone who is interested in modeling a physical phenomena with a complementary set of tools.

PDF Abstract

This paper (Ray et al., 2023 ) provides lecture notes covering the interface between deep learning and computational physics. The notes are aimed at graduate students with a background in applied mathematics, particularly linear algebra and partial differential equations. The core idea is to leverage the connections between deep learning algorithms and conventional computational physics techniques both to understand deep learning algorithms better and to use novel deep learning methods to solve challenging problems in physics.

The notes begin by contrasting computational physics, which relies on postulating physical laws and numerically solving mathematical models (like ODEs/PDEs), with machine learning, which extracts patterns and relationships directly from data without requiring explicit physical laws. The motivations for combining these fields include using physics knowledge to regularize data-hungry ML models, using ML for components of a system that lack a known physical law, and applying analysis tools from computational physics (like functional analysis, numerical analysis) to deep learning.

The document covers several key topics at this intersection:

Introduction to Deep Neural Networks (MLPs):
- Architecture: Explains the structure of Multi-Layer Perceptrons (MLPs) consisting of input, hidden, and output layers, where each layer performs an affine transformation followed by a component-wise non-linear activation. The weights and biases are the trainable parameters.
- Activation Functions: Discusses common activation functions like Linear, ReLU, Leaky ReLU, Logistic (Sigmoid), Tanh, and Sine. Their properties (smoothness, range, gradient behavior) are described with practical implications for network training and function approximation. For instance, ReLU's non-smoothness at zero can lead to "dying neurons", while smooth activations like Sine enable the approximation of higher-order derivatives needed for solving PDEs.
- Expressivity: Introduces the concept of network expressivity, illustrating how the depth and width of an MLP, particularly with non-linear activations like ReLU, contribute to its ability to approximate complex functions with many "kinks" or changes in slope. Universal approximation theorems (by Pinkus [Pinkus1999], Kidger [Kidger2020], Yarotsky [Yarotsky2021]) are mentioned, stating that sufficiently wide or deep MLPs can approximate any continuous function to arbitrary accuracy.
- Training Process: Describes the standard supervised learning paradigm involving splitting data into training, validation, and testing sets (typically 60/20/20). Training involves minimizing a loss function (like Mean Squared Error - MSE or Mean Absolute Error - MAE) over the training set to find optimal network parameters ($\btheta$). Validation is used to tune hyperparameters ($\Hp$) like network architecture or regularization strength using a validation set. Testing evaluates the performance of the final model on unseen data using a test set, providing an estimate of the generalization error.
- Generalizability and Regularization: Discusses the problem of overfitting, where the network learns the noise in the training data instead of the underlying pattern. Regularization techniques, such as L1 or L2 penalties added to the loss function, are introduced to mitigate overfitting by discouraging large weights. Large weights are shown to make the network output overly sensitive to input variations, indicating ill-posedness. Regularization encourages "flatter" minima in the loss landscape, which empirically tend to generalize better.
- Gradient Descent and Optimization: Explains Gradient Descent (GD) as the core algorithm for minimizing the loss function by iteratively updating parameters in the direction opposite to the gradient. Discusses its convergence properties (especially for convex functions) and the impact of the learning rate ( $\eta$ ). Advanced optimization algorithms like Momentum and Adam [kingma2017adam] are presented as improvements that use gradient history and adaptive, per-parameter learning rates to accelerate convergence and navigate complex loss landscapes, potentially avoiding oscillations.
- Stochastic Optimization: Introduces Stochastic Gradient Descent (SGD) and mini-batch GD as necessary methods for training with large datasets, where computing the gradient over the entire dataset (full batch) is computationally prohibitive. SGD uses the gradient of a single sample, while mini-batch GD uses the gradient over a small batch of samples. Mini-batch GD is the practical standard, balancing computational efficiency with gradient accuracy. A decaying learning rate is often necessary for SGD/mini-batch convergence, especially near the minimum.
- Back-propagation: Details the back-propagation algorithm, an efficient application of the chain rule, for computing the gradients of the loss function with respect to all network parameters. This is crucial for implementing gradient-based optimization. Computational graphs are used to illustrate the flow of forward computation and backward gradient calculation.
- Regression vs. Classification: Distinguishes these two common supervised tasks. Regression predicts continuous outputs (e.g., house price), typically using MSE/MAE loss. Classification predicts discrete categories (e.g., image class), commonly using a Softmax output layer to output probabilities and the Cross-Entropy loss function, which penalizes confident incorrect predictions more severely than MSE. One-hot encoding of labels is discussed for classification.
Residual Neural Networks (ResNets):
- Vanishing Gradients: Explains the problem where gradients become extremely small during back-propagation through many layers, preventing effective training of early layers in very deep networks. This limits the practical depth of standard MLPs.
- ResNets Architecture: Introduces ResNets [he2015deep] and their key feature: skip connections (identity mappings) that add the input of a block directly to its output. The update rule becomes $x^{(l)} = \sigma(W^{(l)} x^{(l-1)} + b^{(l)}) + x^{(l-1)}$ .
- Mitigation of Vanishing Gradients: Explains how skip connections enable gradients to propagate more effectively through deep networks, even if the learned transformations ( $W^{(l)} x^{(l-1)} + b^{(l)}$ ) are small. This allows for training networks with significantly greater depth than previously possible.
- Connection to ODEs: Highlights the analogy between the ResNet layer update and the forward Euler discretization of an Ordinary Differential Equation (ODE). This connection suggests that ResNets implicitly model the evolution of state variables over "depth" as if it were time in an ODE.
- Neural ODEs: Introduces Neural ODEs [chen2019neural] as models that explicitly parameterize the vector field ( $\dot{x} = V(x,t)$ ) of an ODE using a neural network. Solving the ODE numerically (e.g., using Runge-Kutta methods) defines the network's output. Compared to ResNets, Neural ODEs can offer memory efficiency (cost is independent of integration steps) and the ability to use adaptive, higher-order ODE solvers.
Solving PDEs with MLPs (PINNs):
- Compares traditional numerical methods for PDEs (Finite Difference, Spectral Collocation) with Deep Learning approaches.
- Finite Difference Method: Briefly describes discretization of the domain, approximating derivatives with finite differences, and solving the resulting system of algebraic equations.
- Spectral Collocation Method: Explains representing the solution as a sum of global basis functions (e.g., Chebyshev polynomials) and determining coefficients by enforcing the PDE and boundary conditions at a set of collocation points. Introduces a least-squares variant that minimizes a loss function based on PDE and BC residuals at these points.
- Physics-Informed Neural Networks (PINNs): Presents PINNs [raissi2019] as MLPs $\mathcal{F}(\x;\btheta)$ that approximate the solution $u(\x)$ of a PDE. The network takes the independent variables ($\x$) as input. Derivatives of the network output with respect to its input (which correspond to derivatives of $u$ ) are computed efficiently using automatic differentiation (back-propagation). The network is trained by minimizing a loss function composed of PDE residuals evaluated at interior collocation points and boundary condition residuals evaluated at boundary points. This is analogous to the least-squares spectral method but uses a neural network as the function approximator. Key considerations include choosing appropriate activation functions (smooth ones like Sine) and balancing the weights of different loss terms (e.g., interior vs. boundary loss), potentially using self-adaptive weighting [wang2021understanding, mcclenny2020self, bischof2021multi].
- Generalization: Explains how PINNs can solve general multi-dimensional, time-dependent, or systems of PDEs by structuring the input and output dimensions appropriately.
- Error Analysis: For linear PDEs, an error bound is presented [mishraPINNs] showing that the $L^2$ error of the PINN solution is related to the achieved training loss (which decreases with network size) and the density of the collocation points (which reduces discretization error).
- Data Assimilation: Describes how PINNs can incorporate observational data by adding a data-mismatch term to the loss function, allowing the network to find a solution that both fits sparse data and satisfies the governing physics.
Convolutional Neural Networks (CNNs):
- Images as Functions: Views images as discretized functions, highlighting the high dimensionality of pixel data for MLPs and the loss of spatial structure when flattening.
- Convolutions: Introduces continuous and discrete convolution as an operation where a kernel (filter) is slid over the input, performing local weighted sums. Kernels can perform operations like smoothing or edge detection (acting like finite difference stencils).
- Convolution Layers: Explains convolution layers as applying multiple kernels to an input image (or feature map) to produce output channels (feature maps). Discusses parameter sharing (the same kernel weights are used across the entire input), padding (controlling output size), and stride (downsampling).
- Pooling Layers: Describes pooling (Max or Average) as non-parameterized operations that reduce the spatial resolution of feature maps (downsampling). This helps create hierarchies of features at different scales, analogous to multi-grid methods.
- CNN Architecture: Presents the typical CNN for tasks like image classification: alternating convolution and pooling layers for extracting hierarchical spatial features, followed by fully connected layers for classification. Emphasizes the advantages of CNNs over MLPs for image data due to parameter sharing and preservation of spatial locality.
- Transpose Convolution: Introduces transpose convolution (or deconvolution/fractional-strided convolution) as the inverse operation to convolution, used for upsampling images. Discusses issues like checker-boarding.
- Image-to-Image Transformations: Describes networks like U-Nets [ronneberger2015unet], which combine downsampling (encoder) and upsampling (decoder) pathways with skip connections, effective for tasks like semantic segmentation or super-resolution. Notes similarities to multi-grid V-cycles.
Operator Networks:
- Limitations of PINNs: Points out that standard PINNs solve a PDE for specific inputs (source, BCs, parameters). To solve for a different input function (e.g., a new source term), the network must be retrained. This is inefficient when the goal is to solve the PDE for many different inputs.
- Parametrized PDEs: Explains how a PINN can be adapted to handle a parameterized input function by adding the parameter to the network input and training over a range of parameter values.
- Operators: Defines mathematical operators as mappings between function spaces (e.g., mapping a source function $f(\x)$ to a solution function $u(\x)$). Many PDEs define such operators. The goal is to learn these operators directly.
- Deep Operator Network (DeepONet): Introduces DeepONets [deeponet] as networks designed to approximate operators $\mathcal{N}: A \rightarrow U$ . A DeepONet consists of a "branch net" that takes the input function $a$ sampled at a fixed set of sensor points, and a "trunk net" that takes the location $\x$ in the output domain. The network output is a combination (typically a dot product) of the outputs of the branch and trunk nets: $\tilde{\mathcal{N}}(\x, a) = \sum \beta_k(a) \tau_k(\x)$. The branch net learns the "coefficients" dependent on the input function, while the trunk net learns "basis functions" dependent on the location. DeepONets are trained supervisedly on pairs of input functions (sampled) and their corresponding output function values.
- Error Analysis: Mentions universal approximation theorems for DeepONets [chen95, sid_deeponet] and error bounds [patel2022variationally] showing that the approximation error depends on network capacity, data quality (numerical solution error), and discretization errors from sampling the input/output functions.
- Physics-Informed DeepONets (PI-DeepONets): Describes how to combine the data-driven training of DeepONets with physics constraints by adding a PDE residual term to the loss function, similar to PINNs [pi_deeponet]. This can reduce data requirements and improve generalization.
- Fourier Neural Operators (FNOs): Introduces FNOs [li2020fourier] as another class of operator networks that directly approximate operators between function spaces. The key innovation is using convolution operations in the Fourier domain within the network layers. This leverages the fact that convolution in physical space corresponds to multiplication in Fourier space.
- Architecture: FNOs extend the MLP structure by having function-valued hidden states. The transformation between layers involves a linear (affine) part and a convolution part.
- Discretization and Fourier Transforms: Explains that implementing FNOs requires discretizing functions on a grid. A naive discrete convolution is computationally expensive. FNOs use the Fast Fourier Transform (FFT) to compute the convolution efficiently ( $O(N \log N)$ instead of $O(N^2)$ per point in 2D) by performing multiplication of Fourier coefficients. This makes FNOs practical for high-resolution problems.
Probabilistic Deep Learning:
- Motivates the need for probabilistic models when dealing with noisy data, inherent stochasticity (e.g., turbulence), or multi-valued inverse problems. Treats inputs, outputs, and parameters as random variables.
- Probability Theory Review: Provides a concise review of key concepts: sample space, events, probability laws, random variables (discrete/continuous), cumulative distribution functions (cdf), probability density functions (pdf), expectation (mean), variance, joint and conditional distributions, and independence.
- Unsupervised Probabilistic Learning (GANs): Focuses on learning the probability distribution ( $f_X$ ) of input data samples $\{\x_i\}$ and generating new samples. Introduces Generative Adversarial Networks (GANs) [goodfellow2014generative]. A GAN consists of a generator ($\g: \Omega_Z \rightarrow \Omega_X$) that maps a simple latent distribution to the data distribution, and a discriminator/critic ($d: \Omega_X \rightarrow \Ro$) that distinguishes real data from generated data. The networks are trained adversarially in a minmax game. Wasserstein GANs (WGANs) [arjovsky2017wasserstein_proc] and their gradient penalty variant [gulrajani2017improved] are discussed for improved stability. Training involves alternating optimization steps for the critic (maximizing the objective) and the generator (minimizing the objective). Theoretical results show the generated distribution can weakly converge to the true distribution.
- Supervised Probabilistic Learning (Conditional GANs): Addresses learning the conditional distribution ($f_{Y|X}(\y|\hat{\x})$) from paired data $(\x_i, \y_i)$. Introduces Conditional GANs (cGANs) [mirza2014]. The generator ($\g: \Omega_Z \times \Omega_X \rightarrow \Omega_Y$) takes both the latent variable $\z$ and the input $\x$ to generate a sample $\y$. The critic ($d: \Omega_X \times \Omega_Y \rightarrow \Ro$) distinguishes real pairs $(\x, \y)$ from fake pairs $(\x, \g(\z, \x))$. cGANs, particularly conditional WGANs [adler2018deep, ray2022], are used to learn the conditional distribution and sample from it, which is highly relevant for probabilistic inverse problems in physics.

Overall, the notes effectively connect concepts from computational physics, such as discretization, basis functions, finite element/difference stencils, and multi-grid methods, to the underlying mechanisms and architectures of deep learning models like MLPs, ResNets, CNNs, PINNs, DeepONets, FNOs, and GANs. This dual perspective provides a richer understanding of both fields and practical insights into implementing and applying deep learning to solve complex problems in physics and engineering.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Deep Ray (19 papers)
Orazio Pinti (3 papers)
Assad A. Oberai (14 papers)

Citations (7)

View on Semantic Scholar

Tweets

https://twitter.com/Jeande_d/status/1843786767989629075

https://twitter.com/amartindelrey/status/1864617750271762776

https://twitter.com/tonikant/status/1845030601272803669

Deep Learning and Computational Physics (Lecture Notes) (2301.00942v1)

Related Papers

Tweets