Strong Heredity Constraint in Regression Models

Updated 10 December 2025

Strong Heredity Constraint is a rule that mandates every higher-order term to have all its parent main effects included, ensuring model structure aligns with scientific reasoning.
Modeling methodologies such as convex penalty frameworks, hierarchical standardization, and Bayesian priors enforce this constraint and reduce the search space in high-dimensional settings.
Enforcing the constraint improves interpretability, controls false positives, and maintains prediction accuracy, illustrating its practical benefits in complex regression analyses.

The strong heredity constraint is a structural restriction applied to statistical models, particularly in high-dimensional regression with interaction or higher-order polynomial terms, which enforces that no interaction or higher-order term is included in a model unless all its lower-order "parent" main effects are also included. This constraint is motivated by the principle of marginality, prevalent in experimental design, statistics, and machine learning with structured features, ensuring model interpretability, stability, and alignment with scientific reasoning (Ye et al., 2018, Chen et al., 2020, Haris et al., 2014, Taylor-Rodriguez et al., 2013).

1. Formal Definition and Mathematical Structure

The strong heredity constraint (SHC) mandates that every included higher-order term must have all its strict lower-order parent terms present in the model. In the case of models with $p$ covariates and up to two-way interactions, let $\beta\in\mathbb{R}^{p+\binom{p}{2}}$ with main-effect coefficients $\beta^{(1)}=(\beta_1,\dots,\beta_p)$ and interaction coefficients $\beta^{(2)}=(\beta_{i,j}:1\leq i<j\leq p)$ . The strong heredity constraint is formalized as

$\mathbb{R}^p_{\text{strong}} := \bigg\{ \beta\in\mathbb{R}^{p+\binom{p}{2}}\,\Bigg|\, \forall\, 1\leq i<j\leq p,\ \mathbf{1}\{\beta_{i,j}\neq0\} \leq \mathbf{1}\{\beta_i\neq 0\}\cdot\mathbf{1}\{\beta_j\neq 0\} \bigg\},$

so $\beta_{i,j}\neq 0$ implies both $\beta_i\neq 0$ and $\beta_j\neq 0$ (Ye et al., 2018, Chen et al., 2020, Haris et al., 2014). In higher-order polynomials or general interaction terms, the constraint is: for each term $\alpha$ , its inclusion indicator $\gamma_\alpha$ must satisfy $\gamma_\alpha\leq \min_{\alpha'\in P(\alpha)}\gamma_{\alpha'}$ , where $P(\alpha)$ denotes the set of parent terms (Taylor-Rodriguez et al., 2013).

By contrast, the weak heredity constraint (WHC) only requires at least one parent to be present.

2. Modeling Methodologies Satisfying Strong Heredity

A variety of algorithmic frameworks implement the strong heredity constraint. They can be broadly categorized as follows:

Penalty-based convex methods: Frameworks such as FAMILY (Haris et al., 2014) encode SHC by using group-structured convex penalties. For a bilinear interaction model $y = X\beta + Z\gamma + \sum_{j,k} \Gamma_{j,k}X_{\cdot,j}Z_{\cdot,k} + \varepsilon$ , SHC is enforced via penalties on rows/columns of the joint coefficient array $B$ , e.g., group $\ell_2$ , $\ell_\infty$ , or hybrid $\ell_1/\ell_\infty$ norms applied over groups corresponding to a main effect plus its interactions. No hard constraints are required; appropriate group penalties (e.g., as in hierNet [Bien et al.]) guarantee that interaction terms cannot be selected unless their parent main effects are nonzero.
Hierarchical standardization: The hierarchical standardization procedure (Chen et al., 2020) ensures SHC by transforming variables such that any selection method on the transformed space, followed by a prescribed back-transformation, results in coefficients that satisfy the SHC algebraically. For main effects $X_j$ and interactions $X_jX_k$ , variables are standardized as $Z_j = (X_j-\bar X_j)/s_j$ , $Z_{jk} = Z_jZ_k$ . Any nonzero interaction coefficient $\hat{\beta}_{jk}$ on the original scale forces both $\hat{\beta}_j$ and $\hat{\beta}_k$ to be nonzero after back-transformation, provided no $X_k$ is constant.
Bayesian model space priors: Bayesian variable selection (Taylor-Rodriguez et al., 2013) applies SHC through hierarchical priors on the binary inclusion indicators, with the SHC constraint equivalently encoded as feasibility constraints on the model space: a term can only be included if all its parents are included. Efficient model-search is performed via tailored Metropolis–Hastings samplers that only propose moves within the SHC-admissible subspace.

3. Statistical Impact: Minimax Rates and Model Complexity

The effect of SHC is most evident in high-dimensional regimes. For a model class with $n$ samples, $p_n$ predictors, and sparsity levels $r_1$ (main) and $r_2$ (interaction terms), the minimax prediction loss under SHC is

$R^*(n) \asymp \frac{\sigma^2}{n} \max \left\{ r_1\left(1+\log\frac{p_n}{r_1}\right),\ r_2\left(1+\log\frac{\binom{r_1}{2}}{r_2}\right) \right\}$

(Ye et al., 2018). SHC reduces the effective search space for interactions from $\binom{p_n}{2}$ (unconstrained) or $r_1(p_n-(r_1+1)/2)$ (WHC) to $\binom{r_1}{2}$ under SHC. The gain is pronounced when $r_1$ and $r_2$ are large relative to $p_n$ ; otherwise, the main effects may dominate the risk, and the rate is unaffected by heredity. Adaptive estimators such as the ABC estimator achieve the minimax rate simultaneously over all heredity regimes, selecting models by penalized residual-sum-of-squares plus hereditary-complexity penalties.

4. Algorithmic and Theoretical Properties

Different families of methods offer distinct computational and theoretical guarantees:

FAMILY convex program (Haris et al., 2014): Provides a unified ADMM-based algorithm with guaranteed global convergence; all prior convex methods (hierNet, VANISH, all-pairs lasso) are special cases via choices of group penalties. Degrees of freedom can be unbiasedly estimated in this regime; extensions to GLMs and cross-validation-based model selection are straightforward.
Hierarchical standardization (Chen et al., 2020): Imposes heredity at the preprocessing stage; preserves convexity and is transparent to choice of selection algorithm. Theoretical results guarantee that all selected interaction (child) terms necessarily have their parent main effects nonzero.
Bayesian SHC priors and MCMC (Taylor-Rodriguez et al., 2013): By assigning hierarchical beta priors (HIP, HUP, HOP, HLP, HTP) conditional on parent inclusion, and designing an MCMC sampler that proposes only SHC-admissible moves (local, intermediate, global), the full posterior model space is efficiently explored. Theorem 1 establishes that the SHC posterior contracts, as $n\to\infty$ , onto the closure of the true model under the parental relation, yielding a unique asymptotic model—unlike WHC, for which multiple minimal-size models may attain positive posterior mass.

5. Practical Implications and Empirical Results

The maintenance of strong heredity leads to improved interpretability and often better control of false positives in feature selection, at little or no loss in predictive performance:

Variable selection: Traditional approaches (standardization plus Lasso/stepwise) can violate SHC in 45–75% of selected models; hierarchical standardization or SHIM maintain SHC nearly 100% of the time, as quantified by the "MSH" (maintenance of strong heredity) metric (Chen et al., 2020).
Sensitivity/specificity tradeoff: Hierarchical standardization typically increases sensitivity by 10–15 percentage points, with a modest specificity decrease, since parents are forced in when their children are selected.
Prediction accuracy: Empirical analyses show that imposing SHC generally leaves prediction MSE unchanged or marginally improved relative to unconstrained methods, even on real-world high-dimensional datasets (e.g., TCGA glioblastoma), where strong heredity solutions produced biologically plausible gene selections with full SHC compliance (Chen et al., 2020).
Bayesian findings: In simulation and data analyses, model-space priors under SHC deliver lower false-positive rates and concentrate posterior mass on more parsimonious, scientifically reasonable models; unconstrained or WHC priors yield higher complexity and elevated false-positive rates (Taylor-Rodriguez et al., 2013).
Complexity penalty: SHC dramatically reduces the number of admissible models. For $p=8$ with $d=2$ (quadratic), the number of SHC models is $7.1\times 10^{10}$ , compared to $6.5\times 10^{11}$ for WHC (Taylor-Rodriguez et al., 2013).

6. Comparison with Alternative Heredity Structures

Constraint	Interaction Eligibility	Effective Interaction Space	Asymptotic Model Uniqueness
Strong heredity	Both parents in model	$\binom{r_1}{2}$	Unique
Weak heredity	At least one parent in	$r_1(p_n - (r_1+1)/2)$	Non-unique
None	Unconstrained	$\binom{p_n}{2}$	Non-unique

SHC imposes the strictest hierarchy, maximizing parsimony and interpretational clarity. The practical benefit—reduced model complexity, lower search penalties, and improved control of false positive selection—scales with the number of potential interactions and the complexity of the true model (Ye et al., 2018, Taylor-Rodriguez et al., 2013).

7. Implementation and Recommendations

Convex group-penalty frameworks (FAMILY, hierNet, glinternet): Empirically effective for large $p$ , supported by ADMM algorithms with theoretical guarantees. Select tuning parameters via cross-validation; for unbiased estimation, consider "relaxed" fits (retraining with selected predictors only) (Haris et al., 2014).
Hierarchical standardization: Recommended as a simple preprocessing for any feature selection pipeline (Lasso, stepwise, etc.) where SHC is desirable. This approach is universally compatible with existing pipelines without need for nonconvex optimization or custom penalties (Chen et al., 2020).
Bayesian structured-model priors: Prior families such as HIP, HUP, HOP, HLP, and HTP afford direct parameterization of complexity penalization and facilitate principled model averaging and uncertainty quantification. Efficient MCMC schemes permit tractable exploration of exponentially large SHC-constrained spaces even for moderate $p$ (Taylor-Rodriguez et al., 2013).
Selection guidance: SHC is most beneficial when the scientific context or data-generating mechanism motivates hierarchical variable inclusion, particularly in high-dimensional regimes with potential for dense interactions.
Limitation: In settings where the number of nonzero interaction terms is low or where main effect sparsity dominates, the benefit of SHC may be minimal in terms of minimax error or selection power (Ye et al., 2018).

References:

(Ye et al., 2018): High-dimensional Adaptive Minimax Sparse Estimation with Interactions (Chen et al., 2020): An Easy-to-Implement Hierarchical Standardization for Variable Selection Under Strong Heredity Constraint (Haris et al., 2014): Convex Modeling of Interactions with Strong Heredity (Taylor-Rodriguez et al., 2013): Bayesian Variable Selection on Model Spaces Constrained by Heredity Conditions