Papers
Topics
Authors
Recent
Search
2000 character limit reached

SPAM: Scalable Polynomial Additive Models

Updated 22 April 2026
  • Scalable Polynomial Additive Models (SPAM) are interpretable models that use symmetric low-rank tensor decompositions to efficiently capture high-order feature interactions.
  • The framework employs mini-batch optimization, input rescaling, and basis dropout to ensure scalable parameter learning and robust regularization.
  • Empirical benchmarks demonstrate that SPAM achieves state-of-the-art performance while providing clear feature attributions compared to traditional black-box models.

Scalable Polynomial Additive Models (SPAM) represent a fundamental advance in interpretable machine learning, bridging the expressive limitations of traditional Generalized Additive Models (GAMs) and the scalability challenges posed by high-dimensional data and higher-order interactions. SPAMs employ low-rank tensor polynomial structures to model all orders of feature interactions with a parameter count and computational demand that scale linearly in the ambient feature dimension, thus matching or exceeding the performance of state-of-the-art black-box models while retaining inherent interpretability (Dubey et al., 2022).

1. Formal Definition and Model Structure

A Scalable Polynomial Additive Model seeks to approximate an unknown target function over dd-dimensional input features x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d as a polynomial of degree kk: P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k} where each order-ll term captures all ll-way feature interactions (Dubey et al., 2022).

A naïve expansion of this polynomial faces a combinatorial explosion: the number of parameters at order ll is (dl)\binom{d}{l}, rendering direct implementation infeasible for large dd or kk. SPAM circumvents this with two observations:

  • Each order-x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d0 weight tensor x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d1 (with x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d2 factors) can be assumed symmetric without loss of generality.
  • Any symmetric x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d3-th order tensor admits a symmetric low-rank CP decomposition:

x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d4

Plugging these decompositions into the polynomial yields: x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d5 with total parameter count x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d6. Each rank x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d7 is typically much smaller than x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d8, ensuring scalability (Dubey et al., 2022).

2. Optimization and Algorithmic Implementation

SPAM parameters x=(x1,,xd)Rdx = (x_1,\dots,x_d)\in \mathbb{R}^d9 are learned through empirical risk minimization: kk0 where kk1 is the loss (e.g., cross-entropy for classification, squared loss for regression), and kk2 encodes kk3 or kk4 penalties for regularization (Dubey et al., 2022).

The entire parameterization supports mini-batch SGD or AdamW with full GPU acceleration. Key implementation innovations include:

  • Input rescaling for stabilization: each feature is transformed via kk5 for order-kk6 terms, harmonizing the scales across interaction orders.
  • “Basis dropout” on rank-1 components—randomly zeroing out kk7 during training—to regularize and improve robustness.
  • For multi-class extensions, basis vectors kk8 can be shared across classes, reducing the parameter count for kk9 classes from P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}0 to P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}1.

The computational cost of evaluation and training scales as P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}2 per sample, avoiding the P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}3 regime of naïve polynomial expansions.

3. Theoretical Risk Guarantees and Expressivity

SPAM provably matches the risk convergence rates of full-rank polynomials under mild “spectral-decay” assumptions (i.e., rapid decay of the singular values P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}4 of the true polynomial). Specifically, for P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}5-regularization and 1-Lipschitz losses, the excess risk satisfies: P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}6 The remainder vanishes as P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}7 increases. This ensures that statistical efficiency is not sacrificed for scalability or interpretability (Dubey et al., 2022).

4. Interpretability and Feature Attribution

SPAMs remain inherently interpretable despite modeling arbitrary higher-order feature interactions. Each polynomial term P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}8 can be directly inspected, and the model’s output decomposes additively across these terms. In practice, feature importances and attributions can be computed in closed form or by differentiation, providing clarity on how individual features and interaction groups influence predictions (Dubey et al., 2022).

A human-subject evaluation (Prediction Task on Amazon Mechanical Turk) empirically confirmed improved interpretability over linear (0.67 mean user accuracy) and LIME explanations (0.65), with SPAM achieving 0.71 (one-sided P(x)=b+i=1dwi(1)xi+i<jwij(2)xixj++i1<<ikwi1,,ik(k)xi1xikP(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}9) using the top ll0 most salient features (Dubey et al., 2022).

5. Benchmark Performance and Empirical Scalability

On a wide range of tasks—including regression, binary and multi-class classification, and object detection—SPAM matches or outperforms prior interpretable models (linear, EBMs, NAMs) and aligns closely with deep neural networks and boosted ensembles (e.g., XGBoost). For example:

  • SPAM-Linear (order 2) surpasses all interpretable baselines and matches XGBoost on tasks such as News20 and object detection.
  • SPAM-Neural (order 2) matches or exceeds XGBoost for regression and binary classification.
  • SPAM (order 3) closes the remaining gap to MLPs on tasks like CoverType (Dubey et al., 2022).

Throughput measurements demonstrate that SPAM sustains tens of millions of examples/sec for datasets with up to 146,000 features, outperforming MLPs and exceeding Neural Additive Models (NAMs) by orders of magnitude in speed (Dubey et al., 2022).

6. Extensions: Tensorization and Higher-Order Data

The concept of scalable polynomial additive modeling generalizes to tensor-valued inputs, as formalized in the Tensor Polynomial Additive Model (TPAM). Instead of flattening tensors and losing structural information, TPAMs employ hierarchical low-order symmetric tensor decompositions—first representing high-order parameter tensors via a sum of symmetric rank-1 tensors, then factorizing each core via a CP decomposition across modes (Chen et al., 2024):

  • T-TPAM maintains the full tensor basis, reducing parameter count to ll1 for order ll2, where ll3 are input dimensions.
  • V-TPAM further reduces complexity to ll4 using vectorized CP decompositions.

This approach delivers up to ll5 parameter compression and enables application to image and multi-way data, with empirical improvements in both predictive accuracy (up to ll6) and localization metrics (PI-CAM yields >80% saliency overlap with true object boxes) relative to traditional SPAMs and state-of-the-art CNN attribution methods (Chen et al., 2024).

7. Limitations, Open Problems, and Future Directions

While SPAMs achieve compelling empirical and theoretical properties, certain limitations persist:

  • The worst-case tensor rank may still be ll7, though empirical spectral decay mitigates this in practical datasets (Dubey et al., 2022).
  • SPAM models yield dense basis vectors ll8, complicating direct interpretation in high dimensions; group sparsity regularizers may address this at the cost of certain fidelity–sparsity trade-offs.
  • Current SPAM and TPAM frameworks handle real-valued data; applications to discrete, structured, or graph domains remain active topics.
  • Extensions beyond interpretability use cases include incorporating SPAM polynomial layers into LLMs, privacy-aware architectures, and other domains demanding high expressivity with transparent attribution.

Further research continues on both architectural innovations (e.g., structured sparsity, tensor networks), hardware optimization (distributed solvers), and expanding self-explanatory mechanisms (e.g., new activation mapping strategies for deep tensorized features) (Dubey et al., 2022, Chen et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scalable Polynomial Additive Models (SPAM).