Deep Neural Network Catapult Phase

Updated 10 April 2026

The catapult phase is a regime in deep neural networks where training with higher than traditional learning rates induces an initial loss spike followed by a rapid reduction in curvature, resulting in much flatter minima.
It is characterized by non-monotonic dynamics where gradient descent temporarily increases the loss before achieving a stable, flatter minimum that enhances generalization.
Empirical observations across deep linear and nonlinear architectures demonstrate that training in the catapult phase yields superior test accuracy and feature alignment compared to traditional settings.

A deep neural network enters the "catapult phase" when trained with a learning rate above the traditional stability bound for local linearized dynamics but below a second, architecture-dependent divergence threshold. In this regime, gradient descent temporarily drives up the loss and sharply reduces the curvature of the loss landscape before settling into a minimum that is significantly flatter than those found using small learning rates. The catapult phase has been observed robustly in deep linear and nonlinear networks, under both full-batch and stochastic gradient descent, and is strongly associated with superior generalization performance across a range of realistic tasks. This nonperturbative phenomenon fills the critical gap between linearized, infinite-width theory and the entirely nonlinear, large-learning-rate dynamics prevalent in practical training scenarios (Lewkowycz et al., 2020, Zhu et al., 2023, Zhu et al., 2022, Meltzer et al., 2023, Huang et al., 2020).

1. Formal Definition and Learning-Rate Regimes

Let $f(\theta;x)$ be a neural network with parameters $\theta$ , and let $L(\theta)$ denote the mean squared error (MSE) loss. The training dynamics exhibit three distinct regimes as the learning rate $\eta$ is varied:

NTK (lazy) phase ( $\eta<\eta_1$ ): Training evolves close to linearly around initialization, governed by the Neural Tangent Kernel (NTK). The top eigenvalue of the Hessian or NTK, $\lambda_0$ , remains constant. All parameter-space directions are contracted.
Catapult phase ( $\eta_1<\eta<\eta_2$ ): For $\eta_1=2/\lambda_0 < \eta < \eta_2$ , the linearized model predicts divergence, but the finite-width nonlinear network displays non-monotonic dynamics: the training loss initially spikes (the "catapult"), then quickly drops as the Hessian curvature $\lambda_t$ decreases dramatically, stabilizing when $\lambda_t < 2/\eta$ . The minimum reached is much flatter, with $\theta$ 0.
Divergent phase ( $\theta$ 1): Training is unstable and diverges completely, or the network collapses (e.g., to dead ReLU units).

Empirically, in practical ReLU networks, $\theta$ 2 with $\theta$ 3; the theoretical toy model yields $\theta$ 4 for two-layer linear networks (Lewkowycz et al., 2020, Meltzer et al., 2023).

2. Catapult Dynamics: Theory and Toy Models

The canonical toy model utilizes a one-hidden-layer linear network with $\theta$ 5, tracked by the squared loss $\theta$ 6. Recursion relations derived for the output $\theta$ 7 and an NTK-like measure $\theta$ 8 clarify phase transitions:

$\theta$ 9

$L(\theta)$ 0

In the catapult phase, unstable dynamics inflate $L(\theta)$ 1 and the leading curvature drops via nonlinear feedback until convergence resumes. The essence of the catapult is captured even in low-dimensional quadratic models, which demonstrate the universality and analytic tractability of this transition (Zhu et al., 2022, Meltzer et al., 2023).

Regime	Learning Rate $L(\theta)$ 2	Behavior
NTK/Lazy Phase	$L(\theta)$ 3	Monotonic loss decay, constant curvature
Catapult Phase	$L(\theta)$ 4	Loss spike, rapid curvature drop, flat minima
Divergence	$L(\theta)$ 5	Loss explodes, no convergence

3. Empirical Manifestations: NTK, Loss, and Feature Learning

Across architectures (MLPs, CNNs, WideResNets) and datasets (MNIST, CIFAR, SVHN), the catapult phase is evidenced by:

Training loss: Characteristic spike followed by rapid decay, localized to the subspace spanned by leading NTK/Hessian eigenvectors.
Curvature: Top NTK or Hessian eigenvalue $L(\theta)$ 6 rapidly diminishes to $L(\theta)$ 7, confirming migration to a flatter region.
Generalization: Test accuracy typically peaks in the catapult window, outperforming both smaller and excessively large learning rates (Lewkowycz et al., 2020, Zhu et al., 2023, Zhu et al., 2022).

Practical SGD, even with small batch sizes, produces catapults whenever $L(\theta)$ 8 momentarily exceeds $L(\theta)$ 9 for the current batch. Smaller batch sizes introduce more variance in NTK eigenvalues, thereby inciting more frequent catapult events and facilitating superior feature learning via alignment with the average gradient outer product (AGOP) of the true predictor (Zhu et al., 2023).

4. Mechanistic Interpretation and Generalization Impact

The critical mechanism underlying the catapult is nonlinear feedback: in the unstable phase, loss growth amplifies leading eigenmodes, provoking a sharp reduction in the NTK/Hessian spectrum (i.e., global curvature). Convergence resumes in a much flatter valley, which is empirically associated with improved generalization. This is a deterministic contraction to flat regions, in contrast to stochastic flattening via noise.

Repeated catapults (e.g., via multiple learning-rate increases or frequent SGD batch transitions) drive enhanced feature alignment and learning, as measured by AGOP alignment. The magnitude of generalization improvements is most pronounced for low-rank target functions or tasks (Zhu et al., 2023).

5. Extensions: Losses, Architectures, and Theory

The bulk of analytic understanding is for MSE loss and full-batch gradient descent. Catapult boundaries are shifted by momentum and adaptive optimizers (for example, with momentum $\eta$ 0, instability in SGD requires $\eta$ 1). Cross-entropy loss, with its non-constant Hessian, complicates direct analysis, though catapult-like dynamics are observed empirically.

Quadratic models (second-order Taylor approximations) replicate the essential catapult phenomenon and generalization trends of neural networks, while linear/NTK models do not, indicating the nonlinearity is the minimal requirement for catapult behavior (Zhu et al., 2022). Extensions to nonlinear, homogeneous two-layer networks confirm the universality of the phase structure (Meltzer et al., 2023).

6. Training Protocols, Limitations, and Open Challenges

Practical guidelines for exploiting the catapult phase include:

Estimating the initial curvature ( $\eta$ 2 via Lanczos or similar methods)
Selecting $\eta$ 3, with $\eta$ 4– $\eta$ 5 for deep ReLU networks
Employing learning-rate warmup to stabilize initial catapult transitions
Employing learning-rate decay to refine convergence after entry into the flat basin

Key limitations and open questions include:

Precise analytic characterization of the catapult phase for general nonlinear and finite-width networks remains incomplete
The dependence of the upper bound constant $\eta$ 6 on architecture, depth, and normalization is not known in closed form
The global picture for cross-entropy loss and adaptive optimization schemes is not yet systematically developed
While strong empirical and toy-model evidence exists for improved generalization via catapulting, a fully rigorous, general theory is an unsolved problem (Lewkowycz et al., 2020, Zhu et al., 2023, Huang et al., 2020)

7. Broader Implications and Universality

The catapult phase elucidates why empirically successful large learning rates in deep-network training can exceed the NTK regime's purported stability limit. By leveraging initial instability, networks traverse out of sharp, initialization-dominated basins and into broad, flat minima, translating into improved test-time behavior. This phenomenon has been robustly identified across architectures, tasks, and optimization settings, and forms a central piece in the modern understanding of implicit bias, curvature manipulation, and learning dynamics in deep neural networks (Lewkowycz et al., 2020, Zhu et al., 2023, Zhu et al., 2022, Meltzer et al., 2023, Huang et al., 2020).