Papers
Topics
Authors
Recent
2000 character limit reached

Mixture-of-GAMs: Interpretable Local Regression

Updated 23 December 2025
  • Mixture-of-GAMs framework is a regression model that combines local generalized additive models with kernel approximations and clustering for enhanced interpretability and prediction accuracy.
  • The method employs random Fourier features and PCA to approximate kernels and reduce dimensionality, enabling stable Gaussian mixture model clustering and local regime identification.
  • Empirical results show significant RMSE improvements over global GAMs on benchmarks like California Housing and NASA Airfoil, demonstrating practical gains in both accuracy and interpretability.

A mixture-of-GAMs framework integrates locally adaptive model structure and kernel-inspired expressivity into inherently interpretable regression. The central construction leverages random Fourier feature (RFF) based embeddings to approximate kernel methods, principal component analysis (PCA) for dimensionality reduction, and Gaussian mixture models (GMM) to discover latent local regimes within data. Within each identified regime, a generalized additive model (GAM) is trained, preserving transparency through univariate spline components. The final regression function combines these local GAMs via soft cluster weights, achieving competitive prediction accuracy and interpretability. This strategy addresses the long-standing challenge of reconciling black-box predictive power with explainable modeling (Huang et al., 22 Dec 2025).

1. Fundamental Structure of the Mixture-of-GAMs Framework

The mixture-of-GAMs estimator is designed for regression scenarios where local data heterogeneity is pronounced, yet model transparency is imperative. The method comprises the following sequence:

  • Random Fourier Features: Construct a mapping ϕ(x)\phi(x) from Rp\mathbb{R}^p to RD\mathbb{R}^D, with inner products approximating a shift-invariant kernel K(x,x)=κ(xx)K(x,x') = \kappa(x-x'). The RFF embedding is determined by sampling frequencies ω1,,ωDρ\omega_1,\dots,\omega_D \sim \rho (the spectral density of κ\kappa) and phases b1,,bDUniform[0,2π]b_1,\dots,b_D \sim \mathrm{Uniform}[0,2\pi], using

ϕ(x)=2D[cos(ω1x+b1),,cos(ωDx+bD)]\phi(x) = \sqrt{\frac{2}{D}}\left[\cos(\omega_1 \cdot x + b_1),\dots,\cos(\omega_D \cdot x + b_D)\right]^\top

  • Principal Component Analysis: The RFF features are compressed to dimension dDd \ll D to stabilize clustering and mitigate the curse of dimensionality. Given centered activations SS, the principal directions VdV_d yield low-dimensional representations zi=Vd(ϕ(xi)sˉ)z_i = V_d^\top (\phi(x_i) - \bar{s}).
  • Gaussian Mixture Model Clustering: The ziz_i are clustered via a GMM, parameterized by weights π\pi_\ell, means μ\mu_\ell, and covariances Σ\Sigma_\ell, yielding soft responsibilities γi=p(zi)\gamma_{i\ell} = p(\ell|z_i) (posterior probabilities).
  • Cluster-wise GAMs: For each cluster =1,,L\ell=1,\dots,L, fit

g(x)=α()+j=1pg,j(xj)g_\ell(x) = \alpha^{(\ell)} + \sum_{j=1}^p g_{\ell,j}(x_j)

where each g,jg_{\ell,j} is a smooth univariate spline with roughness penalization.

  • Final Prediction: For input xx, compute the low-dimensional representation and cluster posteriors, and predict via

f(x)==1Lγ(x)g(x)f(x) = \sum_{\ell=1}^L \gamma_\ell(x) \cdot g_\ell(x)

This framework achieves near-kernel regression accuracy with interpretable additive decomposition within each local regime (Huang et al., 22 Dec 2025).

2. Random Fourier Feature Embedding

Random Fourier features approximate continuous, shift-invariant, positive-definite kernels via explicit feature maps. For K(x,x)=κ(xx)K(x, x') = \kappa(x-x'), Bochner’s theorem gives

κ(δ)=Rpρ(ω)eiωδdω,\kappa(\delta) = \int_{\mathbb{R}^p} \rho(\omega) e^{i \omega \cdot \delta} d\omega,

where ρ(ω)\rho(\omega) is the spectral density of κ\kappa. Sampling ωkρ\omega_k \sim \rho and bkUniform[0,2π]b_k \sim \mathrm{Uniform}[0,2\pi] for k=1,,Dk=1,\dots,D yields a feature mapping ϕ(x)\phi(x). The kernel is approximated as

ϕ(x)ϕ(x)K(x,x).\phi(x)^\top \phi(x') \approx K(x, x').

This allows scalable kernel ridge regression by operating in the DD-dimensional space, unlike the O(N2)O(N^2) cost of standard kernel methods.

After fitting the RFF-ridge model yiβϕ(xi)y_i \approx \beta^\top \phi(x_i), extract activations SRN×DS \in \mathbb{R}^{N \times D} and prepare for dimensionality reduction via PCA.

3. Dimensionality Reduction via PCA

Given the potentially high-dimensional RFF map, clustering directly in DD dimensions is unstable. PCA is performed on the centered activations S1NsˉS - 1_N \bar{s} to compute leading dd directions:

S1Nsˉ=S~=UΣV,VdRD×dS - 1_N \bar{s} = \tilde{S} = U \Sigma V^\top, \qquad V_d \in \mathbb{R}^{D \times d}

The compressed representations Z=S~VdZ = \tilde{S} V_d minimize reconstruction error. Each ziz_i encodes the location of xix_i in the compressed RFF latent space. This step is essential for robust downstream clustering.

4. Gaussian Mixture Model Clustering and Cluster Assignment

A Gaussian mixture model with LL components is fit to {zi}\{z_i\}:

p(z;Θ)==1LπN(zμ,Σ)p(z; \Theta) = \sum_{\ell=1}^L \pi_\ell \mathcal{N}(z \mid \mu_\ell, \Sigma_\ell)

Model parameters Θ={π,μ,Σ}\Theta = \{\pi_\ell, \mu_\ell, \Sigma_\ell\} are estimated using the expectation-maximization (EM) algorithm. Soft cluster assignments (responsibilities) are given by

γi=p(zi)=πN(ziμ,Σ)=1LπN(ziμ,Σ)\gamma_{i\ell} = p(\ell | z_i) = \frac{\pi_\ell\, \mathcal{N}(z_i \mid \mu_\ell, \Sigma_\ell)}{\sum_{\ell'=1}^L \pi_{\ell'}\, \mathcal{N}(z_i \mid \mu_{\ell'}, \Sigma_{\ell'})}

These quantify the affinity of data point xix_i to each cluster, facilitating localized modeling in the next stage.

5. Construction of the Mixture-of-GAMs Predictor

Each cluster \ell specifies a GAM:

g(x)=α()+j=1pg,j(xj)g_\ell(x) = \alpha^{(\ell)} + \sum_{j=1}^p g_{\ell,j}(x_j)

where g,jg_{\ell,j} is a univariate function represented by a spline basis:

g,j(t)=q=1Qjθj,q()ϕj,q(t)g_{\ell,j}(t) = \sum_{q=1}^{Q_j} \theta^{(\ell)}_{j,q} \phi_{j,q}(t)

Smoothness is enforced by penalizing the integrated squared second derivative,

(g,j(t))2dt(θj())Ωjθj()\int (g_{\ell,j}^{\prime\prime}(t))^2 dt \approx (\theta_j^{(\ell)})^\top \Omega_j \theta_j^{(\ell)}

with Ωj\Omega_j a finite-difference penalty matrix. The final mixture output is

f(x)==1Lγ(x)g(x)f(x) = \sum_{\ell=1}^L \gamma_\ell(x) g_\ell(x)

where γ(x)\gamma_\ell(x) is the soft cluster assignment for input xx’s latent representation.

6. Training Objectives and Optimization Pipeline

The method is trained via a staged optimization pipeline rather than full joint training:

  1. RFF Ridge Regression: Minimize

LRFF(β)=i=1N(βϕ(xi)yi)2+λβ2L_{\rm RFF}(\beta) = \sum_{i=1}^N (\beta^\top \phi(x_i) - y_i)^2 + \lambda \|\beta\|^2

  1. GMM Fitting: Maximize the log-likelihood on {zi}\{z_i\},

GMM(Θ)=i=1Nlnp(zi;Θ)\ell_{\rm GMM}(\Theta) = \sum_{i=1}^N \ln p(z_i; \Theta)

  1. GAM Fitting: For each cluster \ell, assign training points to cluster i=argmaxγi\ell_i = \arg\max_\ell \gamma_{i\ell} and fit gg_\ell by minimizing

LGAM()(α(),θ())=i:cluster(i)=(yiα()jg,j(xij))2+λsmoothj(θj())Ωjθj()L_{\rm GAM}^{(\ell)}(\alpha^{(\ell)}, \theta^{(\ell)}) = \sum_{i: \mathrm{cluster}(i)=\ell} (y_i - \alpha^{(\ell)} - \sum_j g_{\ell,j}(x_{ij}))^2 + \lambda_{\rm smooth} \sum_j (\theta_j^{(\ell)})^\top \Omega_j \theta_j^{(\ell)}

  1. Final Prediction Formation: Combine cluster predictions via soft weights to yield f(x)f(x). Optional iterative refinement (e.g., updating RFF or GMM on residuals) is possible but not central to the primary study.

A summary pseudocode for the pipeline is provided in the primary reference (Huang et al., 22 Dec 2025).

7. Empirical Performance and Applications

Performance was assessed on regression benchmarks:

  • California Housing (N ≈ 20,640, p = 8)
  • NASA Airfoil Self-Noise (N = 1,503, p = 5)
  • Bike Sharing (N ≈ 17,379, p ≈ 12)

Root-mean-squared error (RMSE) was the primary metric, with the following comparative results:

Dataset Global GAM RMSE Mixture-of-GAMs RMSE RFF RMSE MLM RMSE Notable Findings
California Housing ≈ 0.567 ≈ 0.501 ≈ 0.44 ≈ 0.57 Mixture-of-GAMs outperforms all interpretable baselines
NASA Airfoil ≈ 4.51 dB ≈ 2.22 dB ≈ 1.08 dB - Substantial (>2×) error reduction over global GAM
Bike Sharing ≈ 88.8 ≈ 58.2 - ≈ 60.9 Mixture-of-GAMs comparable to mixture-of-linear-models (MLM-cell)

These results demonstrate that the RFF-driven mixture-of-GAMs framework identifies meaningful local regimes in data and achieves much improved prediction accuracy over classical additive models, while remaining interpretable (Huang et al., 22 Dec 2025). The construction is applicable to real-world regression problems requiring both predictive strength and local interpretability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Mixture-of-GAMs Framework.