LL-GLM: Last-Layer Linearization
- LL-GLM is a method that transforms deep neural networks into generalized linear models by linearizing only the final layer while keeping the backbone fixed.
- It leverages closed-form optimization and Bayesian inference for uncertainty quantification, improving predictability and interpretability.
- Empirical and theoretical analyses demonstrate that LL-GLM achieves faster, more efficient training with competitive uncertainty metrics compared to standard approaches.
Last-layer linearization (often termed "LL-GLM" in the literature) refers to the methodology of reducing a deep neural network to an effective generalized linear model (GLM) by either optimizing, linearizing, or performing inference solely in the output (last) linear layer, with the backbone or feature-extracting layers treated as deterministic functions. This approach brings together closed-form optimization, uncertainty quantification, and Bayes-optimality for the final layer while retaining the advantages of deep nonlinear feature extraction. LL-GLM has become central in scalable Bayesian inference, rigorous uncertainty quantification, efficient optimization, and interpretability for modern neural networks across supervised learning regimes.
1. Mathematical Formalism and Problem Setup
Let , where is input, denotes the backbone feature map parameterized by , and is the final linear layer. The canonical loss over samples is the regularized squared loss: In the Bayesian treatment, is endowed with a (matrix) normal prior and inference proceeds by optimizing the marginal likelihood over both backbone parameters and observation-level noise (homoscedastic or heteroscedastic).
For Bayesian LL-GLM, the generative model for is: 0 The marginal (evidence) after integrating out 1 is analytically tractable, allowing for end-to-end training by backpropagation and tractable posterior predictive distributions (Fiedler et al., 2023, Wang et al., 2024).
2. Closed-Form Optimization and Alternating Minimization
Under squared loss with 2 regularization, the optimal 3 for fixed 4 is obtained by ridge regression: 5 where 6 is the feature matrix and 7 is the output matrix. LL-GLM proceeds by alternating:
- Fixing 8, solving for 9 in closed form.
- Taking a gradient step in 0 with 1, followed by re-solving 2.
The update does not require backpropagation through the matrix inverse, leveraging the envelope theorem 3. The deterministic algorithm iteratively alternates backbone gradient steps and closed-form last-layer solutions (Galashov et al., 6 Oct 2025). In stochastic mini-batch settings, a proximal regularization is applied for stability and efficiency. Empirically, this procedure outperforms vanilla SGD, especially at large batches or low noise, and is competitive or superior to alternatives in tasks such as regression and classification (CIFAR-10/100, Imagenet), as well as applied operator learning and two-stage IV regression (Galashov et al., 6 Oct 2025).
3. Bayesian Last-Layer Inference and Uncertainty Quantification
LL-GLM provides a principled Bayesian mechanism for uncertainty quantification by treating the last-layer weights as random variables, with all other parameters fixed or optimized by empirical Bayes. For multivariate regression under general noise models, the full posterior is available in closed form:
- Posterior mean: 4
- Posterior covariance: 5
For a test input 6, the predictive distribution decomposes the variance into aleatoric and epistemic components: 7 where 8, and 9 models input-dependent noise. This yields a single forward pass algorithm for decomposed uncertainty (Wang et al., 2024). Tuning the last-layer prior, e.g., 0, enables well-calibrated uncertainty extrapolation beyond the training manifold (Fiedler et al., 2023).
4. Theoretical Guarantees and Limits
Random matrix theory and the infinite-width limit (NTK regime) provide rigorous support for LL-GLM:
- In the NTK regime, LL-GLM achieves global minimization of the objective 1, provided the tangent kernel is positive-definite and the feature rank is sufficient. The projected outputs and features evolve under kernel gradient flow, and the residual error contracts to zero as 2 (Galashov et al., 6 Oct 2025).
- Full-network (“DNN-GLM”) and last-layer (“LL-GLM”) Bayesian linearizations offer identical asymptotic uncertainty quantification in the double-scaling limit. The free energy of the last-layer conjugate kernel matches that of the full NTK kernel under broad conditions (Wilson et al., 29 May 2026). Empirically, the uncertainty metrics (RMSE, NLL, ECE for regression; distance-aware AUC-ROC and LPPD for classification and OOD tasks) are statistically indistinguishable up to 3–4.
Theoretical tables:
| Regime | LL-GLM | DNN-GLM | Reference |
|---|---|---|---|
| Infinite width/data | Exact UQ (CK/NTK) | Exact UQ (NTK) | (Wilson et al., 29 May 2026) |
| Finite width, n5d | Small underperf. | Exact | (Wilson et al., 29 May 2026) |
A plausible implication is that UQ for post-hoc neural predictors is dominated by uncertainty at the feature-to-output mapping, not in the backbone feature extraction (Wilson et al., 29 May 2026).
5. Advanced Algorithms and High-Dimensional Extensions
Approximate Message Passing (AMP) theory situates LL-GLM as a specialization of single-index models for high-dimensional estimation in random linear systems with nonlinear output channels. When all but the last layer are linear, the ML-AMP algorithm reduces to an efficient iterative G-AMP/OAMP procedure, with state evolution equations describing the empirical Bayes-optimal MMSE for signal recovery (Manoel et al., 2017). The replica potential and scalar fixed-point recursions coincide with traditional GLMs, simplifying the analysis.
In the context of optimization, stochastic LL-GLM combines a batch-wise proximal loss with a Kalman-filter interpretation for the last-layer, balancing current batch error and previous running estimate. This is critical in large-scale or small-batch scenarios for stability and efficiency (Galashov et al., 6 Oct 2025).
6. Practical Implementation and Computational Considerations
LL-GLM is attractive for large-scale deep learning due to its computationally favorable properties:
- The required matrix inversion is 6 per update/prediction (for 7 tractable), with further acceleration via Woodbury identities if 8 or by maintaining running covariance statistics instead of explicit inversion (Galashov et al., 6 Oct 2025).
- For Bayesian predictive uncertainty, only one 9 or 0 matrix inversion is required per test input.
- For classification and non-Gaussian likelihoods, Laplace or variational approximations extend the method to general exponential-family outputs (Wang et al., 2024, Immer et al., 2020).
- Memory and compute costs are dominated by the last-layer, with multiple best practices: constant features for bias, validation-based tuning of regularization, and diagonal/K-FAC approximations in full Laplace setups (Galashov et al., 6 Oct 2025, Immer et al., 2020).
Empirically, LL-GLM is 1–2 faster and uses 3–4 less memory than full-network approaches, with no statistically significant UQ degradation across modern datasets and architectures (MLPs, CNNs, ResNets, GPT-2) (Wilson et al., 29 May 2026).
7. Local Linearization and Interpretability in LLMs
LL-GLM can be applied to frozen transformer LLMs via gradient-hacked backward passes. By "detaching" nonlinear intermediates (e.g., LayerNorm stats, gated activations, softmax matrices) in the backward computational graph, the network’s output logits become exactly linear with respect to the input embedding at a fixed context. The effective Jacobian 5 encapsulates this mapping: 6 Singular value decomposition of 7 reveals that LLMs operate in low-dimensional subspaces, with dominant singular vectors corresponding to semantically salient directions in the input/output space. This enables SVD-based interpretability of token concepts and facilitates direct “steering” of outputs via concept-aligned interventions. The process reduces next-token inference to a single matrix-vector multiplication per context (Golden, 30 May 2025).
8. Extensions and Future Perspectives
LL-GLM’s formalism accommodates a variety of extensions:
- Heteroscedastic and structured-output regression, with Bayesian last-layer inference over matrix-variate weights and non-Gaussian output noise (Wang et al., 2024).
- Non-Gaussian likelihoods (Bernoulli, softmax, Poisson), where generalized linearization enables variational or Laplace-based last-layer inference (Immer et al., 2020).
- Information-theoretic analysis of MMSE and free energy, with connections to the replica method and statistical physics (Manoel et al., 2017).
- Deep operator learning and two-stage instrumental variable regression, where the never-gradient-updated last layer can be solved in closed form at train and test time, obviating the need for repeated "last-layer retraining" (Galashov et al., 6 Oct 2025).
A well-supported recommendation is to standardize LL-GLM approaches for scalable, uncertainty-aware supervised deep learning in both research and applied contexts (Wilson et al., 29 May 2026).