Mixture-of-GAMs: Interpretable Local Regression

Updated 23 December 2025

Mixture-of-GAMs framework is a regression model that combines local generalized additive models with kernel approximations and clustering for enhanced interpretability and prediction accuracy.
The method employs random Fourier features and PCA to approximate kernels and reduce dimensionality, enabling stable Gaussian mixture model clustering and local regime identification.
Empirical results show significant RMSE improvements over global GAMs on benchmarks like California Housing and NASA Airfoil, demonstrating practical gains in both accuracy and interpretability.

A mixture-of-GAMs framework integrates locally adaptive model structure and kernel-inspired expressivity into inherently interpretable regression. The central construction leverages random Fourier feature (RFF) based embeddings to approximate kernel methods, principal component analysis (PCA) for dimensionality reduction, and Gaussian mixture models (GMM) to discover latent local regimes within data. Within each identified regime, a generalized additive model (GAM) is trained, preserving transparency through univariate spline components. The final regression function combines these local GAMs via soft cluster weights, achieving competitive prediction accuracy and interpretability. This strategy addresses the long-standing challenge of reconciling black-box predictive power with explainable modeling (Huang et al., 22 Dec 2025).

1. Fundamental Structure of the Mixture-of-GAMs Framework

The mixture-of-GAMs estimator is designed for regression scenarios where local data heterogeneity is pronounced, yet model transparency is imperative. The method comprises the following sequence:

Random Fourier Features: Construct a mapping $\phi(x)$ from $\mathbb{R}^p$ to $\mathbb{R}^D$ , with inner products approximating a shift-invariant kernel $K(x,x') = \kappa(x-x')$ . The RFF embedding is determined by sampling frequencies $\omega_1,\dots,\omega_D \sim \rho$ (the spectral density of $\kappa$ ) and phases $b_1,\dots,b_D \sim \mathrm{Uniform}[0,2\pi]$ , using

$\phi(x) = \sqrt{\frac{2}{D}}\left[\cos(\omega_1 \cdot x + b_1),\dots,\cos(\omega_D \cdot x + b_D)\right]^\top$

Principal Component Analysis: The RFF features are compressed to dimension $d \ll D$ to stabilize clustering and mitigate the curse of dimensionality. Given centered activations $S$ , the principal directions $V_d$ yield low-dimensional representations $z_i = V_d^\top (\phi(x_i) - \bar{s})$ .
Gaussian Mixture Model Clustering: The $z_i$ are clustered via a GMM, parameterized by weights $\pi_\ell$ , means $\mu_\ell$ , and covariances $\Sigma_\ell$ , yielding soft responsibilities $\gamma_{i\ell} = p(\ell|z_i)$ (posterior probabilities).
Cluster-wise GAMs: For each cluster $\ell=1,\dots,L$ , fit

$g_\ell(x) = \alpha^{(\ell)} + \sum_{j=1}^p g_{\ell,j}(x_j)$

where each $g_{\ell,j}$ is a smooth univariate spline with roughness penalization.

Final Prediction: For input $x$ , compute the low-dimensional representation and cluster posteriors, and predict via

$f(x) = \sum_{\ell=1}^L \gamma_\ell(x) \cdot g_\ell(x)$

This framework achieves near-kernel regression accuracy with interpretable additive decomposition within each local regime (Huang et al., 22 Dec 2025).

2. Random Fourier Feature Embedding

Random Fourier features approximate continuous, shift-invariant, positive-definite kernels via explicit feature maps. For $K(x, x') = \kappa(x-x')$ , Bochner’s theorem gives

$\kappa(\delta) = \int_{\mathbb{R}^p} \rho(\omega) e^{i \omega \cdot \delta} d\omega,$

where $\rho(\omega)$ is the spectral density of $\kappa$ . Sampling $\omega_k \sim \rho$ and $b_k \sim \mathrm{Uniform}[0,2\pi]$ for $k=1,\dots,D$ yields a feature mapping $\phi(x)$ . The kernel is approximated as

$\phi(x)^\top \phi(x') \approx K(x, x').$

This allows scalable kernel ridge regression by operating in the $D$ -dimensional space, unlike the $O(N^2)$ cost of standard kernel methods.

After fitting the RFF-ridge model $y_i \approx \beta^\top \phi(x_i)$ , extract activations $S \in \mathbb{R}^{N \times D}$ and prepare for dimensionality reduction via PCA.

3. Dimensionality Reduction via PCA

Given the potentially high-dimensional RFF map, clustering directly in $D$ dimensions is unstable. PCA is performed on the centered activations $S - 1_N \bar{s}$ to compute leading $d$ directions:

$S - 1_N \bar{s} = \tilde{S} = U \Sigma V^\top, \qquad V_d \in \mathbb{R}^{D \times d}$

The compressed representations $Z = \tilde{S} V_d$ minimize reconstruction error. Each $z_i$ encodes the location of $x_i$ in the compressed RFF latent space. This step is essential for robust downstream clustering.

4. Gaussian Mixture Model Clustering and Cluster Assignment

A Gaussian mixture model with $L$ components is fit to $\{z_i\}$ :

$p(z; \Theta) = \sum_{\ell=1}^L \pi_\ell \mathcal{N}(z \mid \mu_\ell, \Sigma_\ell)$

Model parameters $\Theta = \{\pi_\ell, \mu_\ell, \Sigma_\ell\}$ are estimated using the expectation-maximization (EM) algorithm. Soft cluster assignments (responsibilities) are given by

$\gamma_{i\ell} = p(\ell | z_i) = \frac{\pi_\ell\, \mathcal{N}(z_i \mid \mu_\ell, \Sigma_\ell)}{\sum_{\ell'=1}^L \pi_{\ell'}\, \mathcal{N}(z_i \mid \mu_{\ell'}, \Sigma_{\ell'})}$

These quantify the affinity of data point $x_i$ to each cluster, facilitating localized modeling in the next stage.

5. Construction of the Mixture-of-GAMs Predictor

Each cluster $\ell$ specifies a GAM:

$g_\ell(x) = \alpha^{(\ell)} + \sum_{j=1}^p g_{\ell,j}(x_j)$

where $g_{\ell,j}$ is a univariate function represented by a spline basis:

$g_{\ell,j}(t) = \sum_{q=1}^{Q_j} \theta^{(\ell)}_{j,q} \phi_{j,q}(t)$

Smoothness is enforced by penalizing the integrated squared second derivative,

$\int (g_{\ell,j}^{\prime\prime}(t))^2 dt \approx (\theta_j^{(\ell)})^\top \Omega_j \theta_j^{(\ell)}$

with $\Omega_j$ a finite-difference penalty matrix. The final mixture output is

$f(x) = \sum_{\ell=1}^L \gamma_\ell(x) g_\ell(x)$

where $\gamma_\ell(x)$ is the soft cluster assignment for input $x$ ’s latent representation.

6. Training Objectives and Optimization Pipeline

The method is trained via a staged optimization pipeline rather than full joint training:

RFF Ridge Regression: Minimize

$L_{\rm RFF}(\beta) = \sum_{i=1}^N (\beta^\top \phi(x_i) - y_i)^2 + \lambda \|\beta\|^2$

GMM Fitting: Maximize the log-likelihood on $\{z_i\}$ ,

$\ell_{\rm GMM}(\Theta) = \sum_{i=1}^N \ln p(z_i; \Theta)$

GAM Fitting: For each cluster $\ell$ , assign training points to cluster $\ell_i = \arg\max_\ell \gamma_{i\ell}$ and fit $g_\ell$ by minimizing

$L_{\rm GAM}^{(\ell)}(\alpha^{(\ell)}, \theta^{(\ell)}) = \sum_{i: \mathrm{cluster}(i)=\ell} (y_i - \alpha^{(\ell)} - \sum_j g_{\ell,j}(x_{ij}))^2 + \lambda_{\rm smooth} \sum_j (\theta_j^{(\ell)})^\top \Omega_j \theta_j^{(\ell)}$

Final Prediction Formation: Combine cluster predictions via soft weights to yield $f(x)$ . Optional iterative refinement (e.g., updating RFF or GMM on residuals) is possible but not central to the primary study.

A summary pseudocode for the pipeline is provided in the primary reference (Huang et al., 22 Dec 2025).

7. Empirical Performance and Applications

Performance was assessed on regression benchmarks:

California Housing (N ≈ 20,640, p = 8)
NASA Airfoil Self-Noise (N = 1,503, p = 5)
Bike Sharing (N ≈ 17,379, p ≈ 12)

Root-mean-squared error (RMSE) was the primary metric, with the following comparative results:

Dataset	Global GAM RMSE	Mixture-of-GAMs RMSE	RFF RMSE	MLM RMSE	Notable Findings
California Housing	≈ 0.567	≈ 0.501	≈ 0.44	≈ 0.57	Mixture-of-GAMs outperforms all interpretable baselines
NASA Airfoil	≈ 4.51 dB	≈ 2.22 dB	≈ 1.08 dB	-	Substantial (>2×) error reduction over global GAM
Bike Sharing	≈ 88.8	≈ 58.2	-	≈ 60.9	Mixture-of-GAMs comparable to mixture-of-linear-models (MLM-cell)

These results demonstrate that the RFF-driven mixture-of-GAMs framework identifies meaningful local regimes in data and achieves much improved prediction accuracy over classical additive models, while remaining interpretable (Huang et al., 22 Dec 2025). The construction is applicable to real-world regression problems requiring both predictive strength and local interpretability.

PDF Markdown Chat (Pro)

References (1)

Cluster-Based Generalized Additive Models Informed by Random Fourier Features (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-GAMs Framework.