Pairwise Feature Interactions

Updated 22 April 2026

Pairwise feature interactions are statistical components that capture the joint influence of two variables beyond their individual effects.
They employ methods from bilinear terms and graph-based models to deep network modules, enabling both mechanistic insight and robust prediction.
Their application spans physics, biology, and social sciences, facilitating efficient discovery of dependency structures in complex systems.

Pairwise feature interactions quantify the dependence between two input variables beyond their marginal (main) effects. By explicitly modeling how the joint state of a pair of features influences an output, such interactions provide mechanistic insight, improve predictive accuracy in non-additive systems, and facilitate interpretability, especially in domains where interactions encode physical, biological, or social mechanisms. Pairwise interactions are fundamental across statistical models, machine learning, and physical sciences, with diverse operationalizations ranging from explicit bilinear terms and graph-based structures to differentiable interaction attributions in deep learning architectures.

1. Mathematical Definitions and Modeling Frameworks

A pairwise interaction, generically, refers to a model component or statistical quantity capturing how the combined values $(x_i, x_j)$ of two features jointly affect an outcome variable, beyond what is explained by their individual (main) effects. In standard statistical and machine learning practice, this is realized by including products such as $x_i x_j$ for continuous variables or indicator products for categorical pairs.

Generalized linear/interactions models: For binary regression, the full pairwise interaction form is

$\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$

where $\beta_{ij}$ quantifies the strength of interaction between $x_i$ and $x_j$ (Xu et al., 2016).

Energy-based models/Ising models: For $N$ binary variables $s \in \{-1, +1\}^N$ , the energy-based (Ising) model restricts the energy function to

$E(s) = -\sum_{i} h_i s_i - \sum_{i < j} J_{ij} s_i s_j,$

with $J_{ij}$ representing pairwise couplings (Feinauer et al., 2020).

Graph-based representations: Interactions can be encoded in a graph $x_i x_j$ 0, with an edge $x_i x_j$ 1 if features $x_i x_j$ 2, $x_i x_j$ 3 interact. Functions over this graph admit the form

$x_i x_j$ 4

(Yamchote et al., 19 Feb 2025).

Deep networks and embeddings: In neural architectures, interaction blocks are implemented via shared or individual sub-networks on pairs of feature embeddings, e.g., in Tree-like Pairwise Interaction Networks (PINs), where modules compute $x_i x_j$ 5 for each feature pair, with the final output aggregating these terms (Richman et al., 21 Aug 2025).

The essential mathematical ingredient is the construction of a second-order term that is sensitive to the joint state of $x_i x_j$ 6, such that neither feature alone suffices to explain the induced effect.

2. Statistical Inference and Identification of Pairwise Interactions

Identifying nonzero or significant pairwise interactions poses methodological and computational challenges due to the quadratic scaling with the number of features and potential statistical confounds. Multiple regimes and frameworks have been proposed:

Influence-based Graph Recovery: In acyclic logistic interaction models, empirical “influence” statistics

$x_i x_j$ 7

can be used to construct a weight matrix over pairs. For forests (acyclic graphs), the true structure is recovered exactly by computing a maximum-weight spanning tree and thresholding on $x_i x_j$ 8 (Xu et al., 2016).

Convex Hierarchical Testing: Jointly tests all main effects and interactions via a convex optimization formulation with hierarchy constraints:

$x_i x_j$ 9

with hierarchy—forcing $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 0 nonzero only if at least one main effect is nonzero—preventing the spurious inclusion of interactions (Bien et al., 2012).

Hierarchical Group-Lasso: Penalizes the inclusion of pairwise terms unless corresponding main effects are also active (strong hierarchy). The loss combines block $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 1 penalties for main and interaction coefficients, with overlaps ensuring hierarchical inclusion (Lim et al., 2013).
Bayesian and Kernel Methods: The “kernel interaction trick” reparameterizes the full interaction space using hyperparameters for main effects, enabling scalable Bayesian inference and exact shrinkage in high dimensions (Agrawal et al., 2019). The kernelized function admits closed-form expressions for all $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 2 using only $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 3 parameters.
Feature Graph Selection in GNNs: Pairwise interaction graphs are constructed such that only true interacting edges are retained, as theoretical results (Minimum Description Length – MDL principle) state that adding noise edges or removing true ones always increases total description length (model + data fit), leading to suboptimal performance (Yamchote et al., 19 Feb 2025).

3. Extensions in Deep Neural Architectures

Deep learning frameworks extend pairwise interactions beyond classical parametric forms:

Explicit architectural modules: PINs define for each pair $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 4 a function $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 5 computed via a two-dimensional embedding and shared multilayer perceptron, which is then combined in an additive fashion for the final prediction. This design enables intrinsic interpretability (inspection of $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 6 surfaces) and efficient computation of SHAP values due to the restriction to pairwise terms (Richman et al., 21 Aug 2025).
Molecular Physics/MD: In atomistic simulation, neural architectures aggregate learnable embeddings on each atom’s neighbors:

$\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 7

where $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 8 is a feature vector derived from atom pair distances, local Coulomb matrix eigenvalues, or atomic numbers. Explicit pairwise aggregation is found to match or exceed the accuracy of high-level DFT calculations and, when enhanced by environmental descriptors, to accurately reproduce many-body effects (Nguyen et al., 2021).

Energy-Based Model Hybrids: For data containing higher-order effects, hybrid energy-based models combine an interpretable pairwise model $\log \frac{P(y=1|x)}{P(y=0|x)} = \sum_{i}\beta_i x_i + \sum_{i < j} \beta_{ij} x_i x_j,$ 9 with a high-capacity neural term $\beta_{ij}$ 0, jointly learned under pseudolikelihood:

$\beta_{ij}$ 1

This approach is robust to higher-order contamination and consistently isolates the true $\beta_{ij}$ 2 from spurious effects, even when complex dependencies are present (Feinauer et al., 2020).

Interaction Attribution in Deep Networks: Integrated Hessians defines a local pairwise interaction attribution for any differentiable model as

$\beta_{ij}$ 3

where $\beta_{ij}$ 4 is the straight-line path from a baseline $\beta_{ij}$ 5 to $\beta_{ij}$ 6. This satisfies axioms including interaction completeness, symmetry, and sensitivity, and is scalable for high-dimensional networks (Janizek et al., 2020).

4. Information-Theoretic and Time Series Approaches

Pairwise feature interactions arise centrally in dependence discovery, time series, and network analysis. A unifying view casts each method as a statistic $\beta_{ij}$ 7 measuring the strength of association between two processes, with over 200 such metrics catalogued in recent literature (Cliff et al., 2022):

Classical measures: Pearson correlation ( $\beta_{ij}$ 8 runtime, high interpretability) detects linear synchronous dependencies; mutual information generalizes to nonlinear relations.
Directional and causal measures: Granger causality (linear, time-lagged), transfer entropy (nonlinear, time-lagged), and Kozachenko–Leonenko causally conditioned entropy enable detection of lagged or causally directed pairwise effects.
Kernel/statistical dependence: Distance correlation and HSIC (Hilbert–Schmidt independence criterion) capture arbitrary nonlinear dependencies with nonparametric kernels.
Feature-based timescale interactions: For complex systems where the interaction is mediated by long-timescale features rather than raw sample values, methods extract candidate time-series features on sliding windows and assess dependence (via mutual information) between such features and future values of the target process. Such approaches outperform classic correlation and mutual information under high noise, short series, or long interaction timescales (Nguyen et al., 2024).
Selection guidelines: Interpretation, computational cost, and assumptions (linearity, synchrony, stationarity) determine the choice of SPI. Empirical findings recommend hybrid or multi-method pipelines and sparse selections for interpretability and efficiency (Cliff et al., 2022).

5. Interpretability, Model Selection, and Practical Guidelines

Interpretability, model complexity, and inductive bias are central themes in pairwise interaction modeling:

Intrinsic interpretability: PINs and GA²Ms are designed so that each pairwise term’s influence can be decomposed and visualized, facilitating insight into complex decision boundaries (Richman et al., 21 Aug 2025).
Sparse modeling: Both theoretical (MDL, graph-theoretic) and empirical results indicate that models with only genuine pairwise interactions yield the most efficient, interpretable, and robust solutions. Introducing superfluous interaction edges is detrimental due to increased variance and noise-fitting (Yamchote et al., 19 Feb 2025).
Hierarchical constraints: Imposing (weak or strong) hierarchy—that is, forcing interactions to be included only if relevant main effects are present—prevents overfitting and ensures scientific plausibility. Hierarchical group-lasso and convex hierarchical testing operationalize these principles effectively in large-scale data (Lim et al., 2013, Bien et al., 2012).
Regularization and scalability: Modern optimization (FISTA, ADMM, block soft-thresholding, adaptive screening) and kernel tricks (Gaussian process reparameterization) mitigate the quadratic scaling, rendering high-dimensional pairwise interaction discovery feasible with rigorous uncertainty quantification (Lim et al., 2013, Agrawal et al., 2019).

6. Empirical Results and Application Domains

Pairwise modeling techniques have demonstrated strong empirical performance:

Robust recovery of interactions: In acyclic logistic models, influence-graph methods recover the true interaction structure with sample complexity $\beta_{ij}$ 9, outperforming generic feature selection (Xu et al., 2016).
Sample-efficient robust hybrid models: Hybrid EBM/NN models accurately isolate true $x_i$ 0 pairwise couplings, even amidst high levels of higher-order noise, with $x_i$ 1– $x_i$ 2 reduction in reconstruction error over pairwise-only models (Feinauer et al., 2020).
Superior interpretability and efficiency: In insurance pricing (French motor MTPL dataset, $x_i$ 3), PIN achieves the lowest Poisson deviance among traditional and deep-learning benchmarks, delivering exact SHAP attributions and easily visualizable interaction structures (Richman et al., 21 Aug 2025).
Force-field and quantum chemistry: Pairwise DNN architectures reach DFT-level accuracy and transferability, with environmental descriptors enhancing many-body physics, as demonstrated on large silicon and Si–Li systems (Nguyen et al., 2021).
Time series and complex systems: Highly comparative pipelines using 237 SPIs deliver improved classification and mechanistic interpretation in wearable, EEG, and fMRI applications (Cliff et al., 2022). Feature-based MI approaches excel in noisy, short, or long-timescale coupling detection (Nguyen et al., 2024).

7. Theoretical Insights, Limitations, and Future Directions

Identifiability and overfitting: The inclusion of spurious (non-interacting) edges, as observed in GNN and MDL analyses, introduces noise, increases overfitting risk, and is theoretically guaranteed to worsen the modeling code length (Yamchote et al., 19 Feb 2025).
Higher-order and nonlinear interactions: While pairwise terms dominate in many domains (physical systems, epidemiology, insurance), real-world data often contain significant higher-order effects. Hybrid models (e.g., EBM plus neural networks) modularly capture such effects while preserving correct reconstruction of pairwise structure (Feinauer et al., 2020).
Expanding interpretability: Integrated Hessians, SHAP-based methods for pairwise terms, and graph-based visualizations expand transparency across a wider range of models and input domains (Janizek et al., 2020, Richman et al., 21 Aug 2025).
Scalability: Advances in optimization (screening, FISTA, kernel compression) and theoretical guarantees (hierarchy, MDL) have enabled application to aggregated feature spaces with $x_i$ 4 and $x_i$ 5 interaction pairs (Lim et al., 2013, Agrawal et al., 2019).
Open questions: Future directions include model selection and structure learning for higher-order interactions, principled architecture constraints ensuring no pairwise leakage in neural modules, and unification across continuous, categorical, and sequence-based inputs. Improved methods for sparse and structured discovery of interactions in graph-based deep networks and in non-i.i.d. settings are also ongoing research fronts.

Pairwise feature interactions remain a cornerstone in interpretable, robust, and physically-informed modeling across scientific and engineering domains. Ongoing methodological advances span inference, scalability, hybrid modeling, and attribution, supported by a growing body of empirical evidence and theoretical analysis (Feinauer et al., 2020, Richman et al., 21 Aug 2025, Cliff et al., 2022, Yamchote et al., 19 Feb 2025, Nguyen et al., 2021, Janizek et al., 2020, Xu et al., 2016, Bien et al., 2012, Lim et al., 2013, Nguyen et al., 2024, Agrawal et al., 2019).