SPAM: Scalable Polynomial Additive Models

Updated 22 April 2026

Scalable Polynomial Additive Models (SPAM) are interpretable models that use symmetric low-rank tensor decompositions to efficiently capture high-order feature interactions.
The framework employs mini-batch optimization, input rescaling, and basis dropout to ensure scalable parameter learning and robust regularization.
Empirical benchmarks demonstrate that SPAM achieves state-of-the-art performance while providing clear feature attributions compared to traditional black-box models.

Scalable Polynomial Additive Models (SPAM) represent a fundamental advance in interpretable machine learning, bridging the expressive limitations of traditional Generalized Additive Models (GAMs) and the scalability challenges posed by high-dimensional data and higher-order interactions. SPAMs employ low-rank tensor polynomial structures to model all orders of feature interactions with a parameter count and computational demand that scale linearly in the ambient feature dimension, thus matching or exceeding the performance of state-of-the-art black-box models while retaining inherent interpretability (Dubey et al., 2022).

1. Formal Definition and Model Structure

A Scalable Polynomial Additive Model seeks to approximate an unknown target function over $d$ -dimensional input features $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ as a polynomial of degree $k$ : $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ where each order- $l$ term captures all $l$ -way feature interactions (Dubey et al., 2022).

A naïve expansion of this polynomial faces a combinatorial explosion: the number of parameters at order $l$ is $\binom{d}{l}$ , rendering direct implementation infeasible for large $d$ or $k$ . SPAM circumvents this with two observations:

Each order- $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 0 weight tensor $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 1 (with $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 2 factors) can be assumed symmetric without loss of generality.
Any symmetric $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 3-th order tensor admits a symmetric low-rank CP decomposition:

$x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 4

Plugging these decompositions into the polynomial yields: $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 5 with total parameter count $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 6. Each rank $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 7 is typically much smaller than $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 8, ensuring scalability (Dubey et al., 2022).

2. Optimization and Algorithmic Implementation

SPAM parameters $x = (x_1,\dots,x_d)\in \mathbb{R}^d$ 9 are learned through empirical risk minimization: $k$ 0 where $k$ 1 is the loss (e.g., cross-entropy for classification, squared loss for regression), and $k$ 2 encodes $k$ 3 or $k$ 4 penalties for regularization (Dubey et al., 2022).

The entire parameterization supports mini-batch SGD or AdamW with full GPU acceleration. Key implementation innovations include:

Input rescaling for stabilization: each feature is transformed via $k$ 5 for order- $k$ 6 terms, harmonizing the scales across interaction orders.
“Basis dropout” on rank-1 components—randomly zeroing out $k$ 7 during training—to regularize and improve robustness.
For multi-class extensions, basis vectors $k$ 8 can be shared across classes, reducing the parameter count for $k$ 9 classes from $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 0 to $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 1.

The computational cost of evaluation and training scales as $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 2 per sample, avoiding the $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 3 regime of naïve polynomial expansions.

3. Theoretical Risk Guarantees and Expressivity

SPAM provably matches the risk convergence rates of full-rank polynomials under mild “spectral-decay” assumptions (i.e., rapid decay of the singular values $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 4 of the true polynomial). Specifically, for $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 5-regularization and 1-Lipschitz losses, the excess risk satisfies: $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 6 The remainder vanishes as $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 7 increases. This ensures that statistical efficiency is not sacrificed for scalability or interpretability (Dubey et al., 2022).

4. Interpretability and Feature Attribution

SPAMs remain inherently interpretable despite modeling arbitrary higher-order feature interactions. Each polynomial term $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 8 can be directly inspected, and the model’s output decomposes additively across these terms. In practice, feature importances and attributions can be computed in closed form or by differentiation, providing clarity on how individual features and interaction groups influence predictions (Dubey et al., 2022).

A human-subject evaluation (Prediction Task on Amazon Mechanical Turk) empirically confirmed improved interpretability over linear (0.67 mean user accuracy) and LIME explanations (0.65), with SPAM achieving 0.71 (one-sided $P(x) = b + \sum_{i=1}^d w_i^{(1)}x_i + \sum_{i<j} w_{ij}^{(2)} x_i x_j + \cdots + \sum_{i_1<\dots<i_k} w_{i_1,\dots,i_k}^{(k)} x_{i_1}\cdots x_{i_k}$ 9) using the top $l$ 0 most salient features (Dubey et al., 2022).

5. Benchmark Performance and Empirical Scalability

On a wide range of tasks—including regression, binary and multi-class classification, and object detection—SPAM matches or outperforms prior interpretable models (linear, EBMs, NAMs) and aligns closely with deep neural networks and boosted ensembles (e.g., XGBoost). For example:

SPAM-Linear (order 2) surpasses all interpretable baselines and matches XGBoost on tasks such as News20 and object detection.
SPAM-Neural (order 2) matches or exceeds XGBoost for regression and binary classification.
SPAM (order 3) closes the remaining gap to MLPs on tasks like CoverType (Dubey et al., 2022).

Throughput measurements demonstrate that SPAM sustains tens of millions of examples/sec for datasets with up to 146,000 features, outperforming MLPs and exceeding Neural Additive Models (NAMs) by orders of magnitude in speed (Dubey et al., 2022).

6. Extensions: Tensorization and Higher-Order Data

The concept of scalable polynomial additive modeling generalizes to tensor-valued inputs, as formalized in the Tensor Polynomial Additive Model (TPAM). Instead of flattening tensors and losing structural information, TPAMs employ hierarchical low-order symmetric tensor decompositions—first representing high-order parameter tensors via a sum of symmetric rank-1 tensors, then factorizing each core via a CP decomposition across modes (Chen et al., 2024):

T-TPAM maintains the full tensor basis, reducing parameter count to $l$ 1 for order $l$ 2, where $l$ 3 are input dimensions.
V-TPAM further reduces complexity to $l$ 4 using vectorized CP decompositions.

This approach delivers up to $l$ 5 parameter compression and enables application to image and multi-way data, with empirical improvements in both predictive accuracy (up to $l$ 6) and localization metrics (PI-CAM yields >80% saliency overlap with true object boxes) relative to traditional SPAMs and state-of-the-art CNN attribution methods (Chen et al., 2024).

7. Limitations, Open Problems, and Future Directions

While SPAMs achieve compelling empirical and theoretical properties, certain limitations persist:

The worst-case tensor rank may still be $l$ 7, though empirical spectral decay mitigates this in practical datasets (Dubey et al., 2022).
SPAM models yield dense basis vectors $l$ 8, complicating direct interpretation in high dimensions; group sparsity regularizers may address this at the cost of certain fidelity–sparsity trade-offs.
Current SPAM and TPAM frameworks handle real-valued data; applications to discrete, structured, or graph domains remain active topics.
Extensions beyond interpretability use cases include incorporating SPAM polynomial layers into LLMs, privacy-aware architectures, and other domains demanding high expressivity with transparent attribution.

Further research continues on both architectural innovations (e.g., structured sparsity, tensor networks), hardware optimization (distributed solvers), and expanding self-explanatory mechanisms (e.g., new activation mapping strategies for deep tensorized features) (Dubey et al., 2022, Chen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Scalable Interpretability via Polynomials (2022)

Tensor Polynomial Additive Model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scalable Polynomial Additive Models (SPAM).

SPAM: Scalable Polynomial Additive Models

1. Formal Definition and Model Structure

2. Optimization and Algorithmic Implementation

3. Theoretical Risk Guarantees and Expressivity

4. Interpretability and Feature Attribution

5. Benchmark Performance and Empirical Scalability

6. Extensions: Tensorization and Higher-Order Data

7. Limitations, Open Problems, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SPAM: Scalable Polynomial Additive Models

1. Formal Definition and Model Structure

2. Optimization and Algorithmic Implementation

3. Theoretical Risk Guarantees and Expressivity

4. Interpretability and Feature Attribution

5. Benchmark Performance and Empirical Scalability

6. Extensions: Tensorization and Higher-Order Data

7. Limitations, Open Problems, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research