Generalized Incomplete Contingency Tables (GICTs)
- GICTs are multidimensional categorical data arrays that include entire rows of zeros due to random missingness or design constraints.
- The framework uses Poisson/multinomial sampling and log-linear parametrizations to model complex interactions within observed data.
- Sharp nonparametric bounding techniques enable robust causal inference even when standard methods fail due to missing cells.
Generalized Incomplete Contingency Tables (GICTs) are multidimensional arrays of categorical data that allow for entire rows of zero counts within the empirical support—either due to random missingness, sampling zeros, or structural design limitations. These tables arise naturally in settings where certain combinations of variables are completely unobserved for random or unpredictable reasons, as opposed to pre-specified structural exclusions. The present entry summarizes the defining properties, mathematical frameworks, inference strategies, and practical causal analysis methodologies associated with GICTs, integrating recent nonparametric bounding procedures and model-theoretic perspectives.
1. Definition and Conceptual Structure
Let be categorical variables with supports and an outcome with support . A classical contingency table consists of cell counts
for and , with total sample size . A table is a GICT if there exists at least one empirical-support row such that
These are sampling zeros entirely within the observed domain, not design-induced missingness.
GICTs generalize classical incomplete tables by treating such random zeros as primary and by explicitly modeling the absence of information in affected rows. The distinction is crucial: in GICTs, the presence of entire rows of zeros prevents definition or direct estimation of certain conditional probabilities or marginal effects, necessitating alternative inferential strategies.
2. Model-Theoretic Foundations of GICTs
GICT analysis formalizes incomplete tables by adopting either Poisson or multinomial sampling frameworks. Consider a cell index set , potentially omitting inaccessible (structurally forbidden) cells but including all empirically observable combinations, including those with random zeros. The random vector is modeled as:
- Poisson: , independent,
- Multinomial: .
Multiplicative (“log-linear-type”) models for GICTs are constructed using a model matrix whose rows correspond to cell subsets, yielding parameterizations
with either or . The “relational” model framework captures both traditional log-linear and generalized odds-ratio models; the structure extends cleanly to GICTs with random zeros provided covers only the empirical domain (Klimova et al., 2011).
MLE existence and uniqueness are governed by the sample realization of the sufficient statistics falling in the interior of the convex hull of 's support. When the overall effect (all-ones row) is absent due to missing rows, the model becomes a curved exponential family, and a mixed mean–canonical parameterization involving both subset sums and non-homogeneous odds ratios is employed.
3. Hierarchical Log-Linear Parametrization and Missing Data Mechanisms
For GICTs arising from data subject to random or systematic missingness, the hierarchical log-linear parametrization extends to incorporate missingness indicators for each variable , with (observed) or $2$ (missing). The resulting model for the augmented table is
subject to zero-sum constraints. This formulation encapsulates the full joint distribution over observed and missing-data patterns (Ghosh et al., 2016).
Missing-data mechanisms for each variable can be characterized as:
- MCAR (missing completely at random): all .
- NMAR (not missing at random): only , all other .
- MAR (missing at random): some for , but .
Direct, closed-form sensitivity analyses of mechanism (MAR vs. MCAR/NMAR) are carried out by comparing response and non-response odds intervals derived purely from fully observed and partially observed margins.
4. Inference and Sharp Nonparametric Bounding of Interventional Queries
When entire rows in a GICT are random zeros, causal or probabilistic queries involving those combinations become non-identifiable in the classical sense. The framework developed in (Lodato et al., 7 Nov 2025) introduces a sharp nonparametric bounding approach:
- Unknown cell probabilities in missing rows are parameterized by free vectors subject to
- Non-negativity: ,
- Normalization: for each missing-row index .
Given a symbolic expression for the query of interest (such as or ATE), under these constraints, the lower and upper sharp bounds are
where optimization is performed over the feasible set determined by the probability simplex for each missing row.
In practical scenarios where missing rows are known to have small total frequency compared to , the expressions often reduce to a linear (or ratio-of-linear) function of , and standard linear programming (or fractional programming after Charnes–Cooper transformation) produces the bounds efficiently. These bounds are mechanism-independent, requiring only support and basic probability axioms, and provide formal quantification of inferential uncertainty in the presence of GICTs.
5. Application to Causal Inference: Worked Example
To illustrate, consider a binary setting:
- (treatment), (covariate), (outcome), each .
- Causal graph: , .
- The observed GICT for features two missing rows: and , both all-zero.
Empirical counts:
- , for and ,
- , for and ,
- .
The target, , is minimized and maximized over , producing . Even with two corners of the table unobserved, this bounds the average treatment effect under minimal, nonparametric assumptions.
6. Implications, Assumptions, and Limitations
- GICT bounds are conservative, often wide, but always contain the true value under the assumptions:
- All random zeros must be internal to the empirical support;
- The small-missing-frequency approximation () yields sharper and algebraically simpler bounds;
- No assumptions are made about the missing-data mechanism (MCAR, MAR, NMAR are all accommodated without modeling).
- The approach does not impute missing data or discard affected entries, but preserves all uncertainty in the explicit optimization.
- If large portions of the table are missing, or missingness is not negligible relative to , bounds may be less informative.
A plausible implication is that the GICT framework enables disciplined, mechanism-agnostic causal inference in high-dimensional settings with moderate 'random' unobservability, in contrast to traditional methods that require either imputation or strong missing-data assumptions.
7. Connections to Other Methodologies and Extensions
GICTs generalize both traditional incomplete tables and structural-zero models; the relational modeling perspective (Klimova et al., 2011) extends to arbitrary sets of allowed cells and encompasses curved exponential families, odds-ratio models, and canonical parameterizations for structural zeros. Log-linear parametrizations with auxiliary missingness indicators provide a systematic means for both modeling and sensitivity testing of possible missingness mechanisms (Ghosh et al., 2016).
Sensitivity analysis procedures can be applied non-iteratively, using only observed cell and margin counts, for empirical assessment of the plausibility of MAR, MCAR, or NMAR regimes.
This suggests a unifying role for GICTs as a framework for rigorous, mechanism-robust statistical modeling and inference in categorical data analysis—even beyond the original context of contingency tables, extending to applications in epidemiology, social science, and high-dimensional causal inference where empirical supports are often only partially observed.