Papers
Topics
Authors
Recent
2000 character limit reached

Strong Heredity Constraint in Regression Models

Updated 10 December 2025
  • Strong Heredity Constraint is a rule that mandates every higher-order term to have all its parent main effects included, ensuring model structure aligns with scientific reasoning.
  • Modeling methodologies such as convex penalty frameworks, hierarchical standardization, and Bayesian priors enforce this constraint and reduce the search space in high-dimensional settings.
  • Enforcing the constraint improves interpretability, controls false positives, and maintains prediction accuracy, illustrating its practical benefits in complex regression analyses.

The strong heredity constraint is a structural restriction applied to statistical models, particularly in high-dimensional regression with interaction or higher-order polynomial terms, which enforces that no interaction or higher-order term is included in a model unless all its lower-order "parent" main effects are also included. This constraint is motivated by the principle of marginality, prevalent in experimental design, statistics, and machine learning with structured features, ensuring model interpretability, stability, and alignment with scientific reasoning (Ye et al., 2018, Chen et al., 2020, Haris et al., 2014, Taylor-Rodriguez et al., 2013).

1. Formal Definition and Mathematical Structure

The strong heredity constraint (SHC) mandates that every included higher-order term must have all its strict lower-order parent terms present in the model. In the case of models with pp covariates and up to two-way interactions, let βRp+(p2)\beta\in\mathbb{R}^{p+\binom{p}{2}} with main-effect coefficients β(1)=(β1,,βp)\beta^{(1)}=(\beta_1,\dots,\beta_p) and interaction coefficients β(2)=(βi,j:1i<jp)\beta^{(2)}=(\beta_{i,j}:1\leq i<j\leq p). The strong heredity constraint is formalized as

Rstrongp:={βRp+(p2)1i<jp, 1{βi,j0}1{βi0}1{βj0}},\mathbb{R}^p_{\text{strong}} := \bigg\{ \beta\in\mathbb{R}^{p+\binom{p}{2}}\,\Bigg|\, \forall\, 1\leq i<j\leq p,\ \mathbf{1}\{\beta_{i,j}\neq0\} \leq \mathbf{1}\{\beta_i\neq 0\}\cdot\mathbf{1}\{\beta_j\neq 0\} \bigg\},

so βi,j0\beta_{i,j}\neq 0 implies both βi0\beta_i\neq 0 and βj0\beta_j\neq 0 (Ye et al., 2018, Chen et al., 2020, Haris et al., 2014). In higher-order polynomials or general interaction terms, the constraint is: for each term α\alpha, its inclusion indicator γα\gamma_\alpha must satisfy γαminαP(α)γα\gamma_\alpha\leq \min_{\alpha'\in P(\alpha)}\gamma_{\alpha'}, where P(α)P(\alpha) denotes the set of parent terms (Taylor-Rodriguez et al., 2013).

By contrast, the weak heredity constraint (WHC) only requires at least one parent to be present.

2. Modeling Methodologies Satisfying Strong Heredity

A variety of algorithmic frameworks implement the strong heredity constraint. They can be broadly categorized as follows:

  • Penalty-based convex methods: Frameworks such as FAMILY (Haris et al., 2014) encode SHC by using group-structured convex penalties. For a bilinear interaction model y=Xβ+Zγ+j,kΓj,kX,jZ,k+εy = X\beta + Z\gamma + \sum_{j,k} \Gamma_{j,k}X_{\cdot,j}Z_{\cdot,k} + \varepsilon, SHC is enforced via penalties on rows/columns of the joint coefficient array BB, e.g., group 2\ell_2, \ell_\infty, or hybrid 1/\ell_1/\ell_\infty norms applied over groups corresponding to a main effect plus its interactions. No hard constraints are required; appropriate group penalties (e.g., as in hierNet [Bien et al.]) guarantee that interaction terms cannot be selected unless their parent main effects are nonzero.
  • Hierarchical standardization: The hierarchical standardization procedure (Chen et al., 2020) ensures SHC by transforming variables such that any selection method on the transformed space, followed by a prescribed back-transformation, results in coefficients that satisfy the SHC algebraically. For main effects XjX_j and interactions XjXkX_jX_k, variables are standardized as Zj=(XjXˉj)/sjZ_j = (X_j-\bar X_j)/s_j, Zjk=ZjZkZ_{jk} = Z_jZ_k. Any nonzero interaction coefficient β^jk\hat{\beta}_{jk} on the original scale forces both β^j\hat{\beta}_j and β^k\hat{\beta}_k to be nonzero after back-transformation, provided no XkX_k is constant.
  • Bayesian model space priors: Bayesian variable selection (Taylor-Rodriguez et al., 2013) applies SHC through hierarchical priors on the binary inclusion indicators, with the SHC constraint equivalently encoded as feasibility constraints on the model space: a term can only be included if all its parents are included. Efficient model-search is performed via tailored Metropolis–Hastings samplers that only propose moves within the SHC-admissible subspace.

3. Statistical Impact: Minimax Rates and Model Complexity

The effect of SHC is most evident in high-dimensional regimes. For a model class with nn samples, pnp_n predictors, and sparsity levels r1r_1 (main) and r2r_2 (interaction terms), the minimax prediction loss under SHC is

R(n)σ2nmax{r1(1+logpnr1), r2(1+log(r12)r2)}R^*(n) \asymp \frac{\sigma^2}{n} \max \left\{ r_1\left(1+\log\frac{p_n}{r_1}\right),\ r_2\left(1+\log\frac{\binom{r_1}{2}}{r_2}\right) \right\}

(Ye et al., 2018). SHC reduces the effective search space for interactions from (pn2)\binom{p_n}{2} (unconstrained) or r1(pn(r1+1)/2)r_1(p_n-(r_1+1)/2) (WHC) to (r12)\binom{r_1}{2} under SHC. The gain is pronounced when r1r_1 and r2r_2 are large relative to pnp_n; otherwise, the main effects may dominate the risk, and the rate is unaffected by heredity. Adaptive estimators such as the ABC estimator achieve the minimax rate simultaneously over all heredity regimes, selecting models by penalized residual-sum-of-squares plus hereditary-complexity penalties.

4. Algorithmic and Theoretical Properties

Different families of methods offer distinct computational and theoretical guarantees:

  • FAMILY convex program (Haris et al., 2014): Provides a unified ADMM-based algorithm with guaranteed global convergence; all prior convex methods (hierNet, VANISH, all-pairs lasso) are special cases via choices of group penalties. Degrees of freedom can be unbiasedly estimated in this regime; extensions to GLMs and cross-validation-based model selection are straightforward.
  • Hierarchical standardization (Chen et al., 2020): Imposes heredity at the preprocessing stage; preserves convexity and is transparent to choice of selection algorithm. Theoretical results guarantee that all selected interaction (child) terms necessarily have their parent main effects nonzero.
  • Bayesian SHC priors and MCMC (Taylor-Rodriguez et al., 2013): By assigning hierarchical beta priors (HIP, HUP, HOP, HLP, HTP) conditional on parent inclusion, and designing an MCMC sampler that proposes only SHC-admissible moves (local, intermediate, global), the full posterior model space is efficiently explored. Theorem 1 establishes that the SHC posterior contracts, as nn\to\infty, onto the closure of the true model under the parental relation, yielding a unique asymptotic model—unlike WHC, for which multiple minimal-size models may attain positive posterior mass.

5. Practical Implications and Empirical Results

The maintenance of strong heredity leads to improved interpretability and often better control of false positives in feature selection, at little or no loss in predictive performance:

  • Variable selection: Traditional approaches (standardization plus Lasso/stepwise) can violate SHC in 45–75% of selected models; hierarchical standardization or SHIM maintain SHC nearly 100% of the time, as quantified by the "MSH" (maintenance of strong heredity) metric (Chen et al., 2020).
  • Sensitivity/specificity tradeoff: Hierarchical standardization typically increases sensitivity by 10–15 percentage points, with a modest specificity decrease, since parents are forced in when their children are selected.
  • Prediction accuracy: Empirical analyses show that imposing SHC generally leaves prediction MSE unchanged or marginally improved relative to unconstrained methods, even on real-world high-dimensional datasets (e.g., TCGA glioblastoma), where strong heredity solutions produced biologically plausible gene selections with full SHC compliance (Chen et al., 2020).
  • Bayesian findings: In simulation and data analyses, model-space priors under SHC deliver lower false-positive rates and concentrate posterior mass on more parsimonious, scientifically reasonable models; unconstrained or WHC priors yield higher complexity and elevated false-positive rates (Taylor-Rodriguez et al., 2013).
  • Complexity penalty: SHC dramatically reduces the number of admissible models. For p=8p=8 with d=2d=2 (quadratic), the number of SHC models is 7.1×10107.1\times 10^{10}, compared to 6.5×10116.5\times 10^{11} for WHC (Taylor-Rodriguez et al., 2013).

6. Comparison with Alternative Heredity Structures

Constraint Interaction Eligibility Effective Interaction Space Asymptotic Model Uniqueness
Strong heredity Both parents in model (r12)\binom{r_1}{2} Unique
Weak heredity At least one parent in r1(pn(r1+1)/2)r_1(p_n - (r_1+1)/2) Non-unique
None Unconstrained (pn2)\binom{p_n}{2} Non-unique

SHC imposes the strictest hierarchy, maximizing parsimony and interpretational clarity. The practical benefit—reduced model complexity, lower search penalties, and improved control of false positive selection—scales with the number of potential interactions and the complexity of the true model (Ye et al., 2018, Taylor-Rodriguez et al., 2013).

7. Implementation and Recommendations

  • Convex group-penalty frameworks (FAMILY, hierNet, glinternet): Empirically effective for large pp, supported by ADMM algorithms with theoretical guarantees. Select tuning parameters via cross-validation; for unbiased estimation, consider "relaxed" fits (retraining with selected predictors only) (Haris et al., 2014).
  • Hierarchical standardization: Recommended as a simple preprocessing for any feature selection pipeline (Lasso, stepwise, etc.) where SHC is desirable. This approach is universally compatible with existing pipelines without need for nonconvex optimization or custom penalties (Chen et al., 2020).
  • Bayesian structured-model priors: Prior families such as HIP, HUP, HOP, HLP, and HTP afford direct parameterization of complexity penalization and facilitate principled model averaging and uncertainty quantification. Efficient MCMC schemes permit tractable exploration of exponentially large SHC-constrained spaces even for moderate pp (Taylor-Rodriguez et al., 2013).
  • Selection guidance: SHC is most beneficial when the scientific context or data-generating mechanism motivates hierarchical variable inclusion, particularly in high-dimensional regimes with potential for dense interactions.
  • Limitation: In settings where the number of nonzero interaction terms is low or where main effect sparsity dominates, the benefit of SHC may be minimal in terms of minimax error or selection power (Ye et al., 2018).

References:

(Ye et al., 2018): High-dimensional Adaptive Minimax Sparse Estimation with Interactions (Chen et al., 2020): An Easy-to-Implement Hierarchical Standardization for Variable Selection Under Strong Heredity Constraint (Haris et al., 2014): Convex Modeling of Interactions with Strong Heredity (Taylor-Rodriguez et al., 2013): Bayesian Variable Selection on Model Spaces Constrained by Heredity Conditions

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Strong Heredity Constraint.