Non-Separable Two-Way Fixed Effects (NSTW)

Updated 28 July 2025

Non-Separable Two-Way Fixed Effects (NSTW) models are advanced econometric frameworks that accommodate nonlinear and non-additive latent unit and time effects to address unobserved heterogeneity in panel and network data.
Estimation methods include approximate factor techniques using PCA-based decompositions and grouped fixed effects that discretize latent heterogeneity, balancing approximation bias and estimation variance.
NSTW models provide robust identification and inference in complex settings such as matched employer‐employee and teacher-student data by leveraging network connectivity and diagnostic tools.

Non-Separable Two-Way Fixed Effects (NSTW) models generalize the classical additive and interactive fixed effects frameworks in panel and network data analysis, enabling researchers to accommodate more general forms of unobserved heterogeneity. This conceptual expansion reflects both recent econometric theory and empirical practice. NSTW models subsume standard two-way fixed effects (TWFE) and interactive fixed effects (IFE) as special, more restrictive cases, admitting much richer (potentially nonlinear and non-additive) relationships between latent unit and time effects. The primary motivation stems from documented failures of separability and homogeneity in applied work, especially in matched employer-employee datasets, student-teacher learning production studies, and multi-dimensional policy evaluation contexts.

1. Model Definition and Network Representation

Canonical two-way fixed effects models (TWFE) are written as

$y = B_1 \mu + B_2 \eta + X\beta + u$

where $\mu$ and $\eta$ are group- or dimension-specific fixed effects (e.g., teacher/firm and student/worker), $B_1$ and $B_2$ are incidence matrices, and $X$ denotes observed covariates. This model is separable: it models unit and time (or other) effects as additive linear terms.

NSTW models, in contrast, allow the unobserved effect to enter as a general, possibly nonlinear and non-additive, function of latent unit and time-specific components,

$c_{it} = h(z_i, f_t)$

where $h$ is any unknown smooth function, $z_i$ are (possibly multidimensional) latent unit characteristics, and $f_t$ are latent time-varying factors (Ditzen et al., 25 Jul 2025). Notably, if $h(z_i, f_t)$ is bilinear (e.g., $z_i'f_t$ ), the model reduces to the standard IFE case.

For bipartite or matched data, as in worker-firm or teacher-student networks, the two-way model can be reframed as a network: each observation represents an "edge" connecting two vertices (e.g., worker $i$ and firm $j$ ); the corresponding regression is

$y = B\alpha + X\beta + u, \quad \alpha_i = \mu_i,~ \alpha_j = -\eta_j$

with $B$ the concatenation of $B_1$ and $-B_2$ , embedding the model within a network context where estimation, identification, and inference rely on graph-theoretic properties (Jochmans et al., 2016).

2. Identification: Separability, Network Structure, and Harmonic Connectivity

NSTW models are identified under less restrictive conditions than TWFE or IFE, leveraging both within-network connectivity and smoothness of $h$ . In the network representation, identification depends on the connectedness of the underlying bipartite graph. Specifically:

Only differences such as $\mu_i - \eta_j$ are identified, implying the necessity of normalization constraints (e.g., $d'\alpha = 0$ , with $d$ the vector of (weighted) vertex degrees).
The variance of estimator components, $\operatorname{var}(\hat\alpha_i)$ , is governed by both local degree $d_i$ and global spectral connectivity ( $\lambda_2$ —the second-smallest eigenvalue of the normalized Laplacian) and local neighbor degree harmonic means ( $h_i$ ):

$\operatorname{var}(\hat\alpha_i) \le \frac{\sigma^2}{d_i}\left(1 + \frac{1}{\lambda_2 h_i}\right) - \frac{2\sigma^2}{m}$

with equality in special cases (Jochmans et al., 2016). Strong identification requires $\lambda_2 h_i \to \infty$ (as $n \to \infty$ ), reflecting high local and global connectivity.

For functional NSTW models,

$y_{it} = x_{it}'\beta + h(z_i, f_t) + \epsilon_{it}$

injectivity of first-moment statistics is necessary: variation in observed averages or clustering proxies must span the latent latent types (Ditzen et al., 25 Jul 2025).

3. Estimation Methods for NSTW

Due to the infinite-dimensional and nonparametric nature of $h(\cdot,\cdot)$ , estimation approaches typically employ one of two broad strategies:

A. Approximate Factor Estimation (ILS, PCA-based)

Drawing on the singular value decomposition of $h$ , one expands:

$h(z_i, f_t) = \sum_{r=1}^\infty \sigma_r \phi_r(z_i) \psi_r(f_t)$

and then approximates with $R$ components, fitting

$(\hat\beta, \hat\lambda, \hat f) = \arg\min_{\beta, \lambda, f} \sum_{it} [y_{it} - x_{it}'\beta - \sum_{r=1}^R \lambda_{ir} f_{tr}]^2$

As $R \to \infty$ appropriately with $N, T$ , this method can recover the NSTW structure up to an approximation error declining with $R$ , at the cost of increased estimation variance (Freeman et al., 2021). The optimal rate of $R$ trades off approximation and estimation error, often yielding $\sqrt{\min(N,T)}$ -consistency.

B. Grouped Fixed Effects (Clustering-based Discretization)

Alternatively, researchers discretize the latent heterogeneity by clustering units and times into $G$ and $C$ clusters, respectively, then estimate

$y_{it} = x_{it}'\beta + c_{g_i, l_t} + \epsilon_{it}$

where $g_i$ and $l_t$ denote the groupings. The two-step grouped fixed effects (TSGF) estimator involves a first-stage clustering (e.g., k-means or hierarchical) then fixed effects regression on group pairs (Ditzen et al., 25 Jul 2025, Freeman et al., 2021). Split-sample implementations may avoid overfitting bias. As $G,C \to \infty$ with sample size, the nonseparable function $h$ can be approximated arbitrarily closely, but an empirical bias-variance tradeoff emerges based on the smoothness of $h$ and the number of groups.

Relevant error expansions for these procedures are:

$\hat\beta = \beta + H^{-1}(N^{-1}\sum_{i=1}^N s_i) + O_p\left(\frac{1}{T} + \frac{1}{N} + \frac{GC}{NT}\right) + O_p\left(G^{-2/K} + C^{-2/K}\right) + o_p(NT^{-1/2})$

where $H$ is the Hessian, $s_i$ are influence components, and $K$ is the dimension of the latent features.

C. Minimal Bridge Function (Moment Equations for Counterfactual Identification)

For identifying average causal effects in NSTW/factor models, bridge functions $h(Y_{\text{pre}}, X; \theta^*)$ are constructed to remove latent confounding via balancing moments:

$\mathbb{E}[Y_0(0) - h(Y_{\text{pre}}, X; \theta^*) | U, A=0, X] = 0$

The minimal bridge function is chosen via regularized GMM to minimize $\| \theta \|_2$ under the identifying restrictions, yielding root- $N$ consistency for average effects even with fixed $T$ (Imbens et al., 2021).

4. Inference and Asymptotic Properties

Achieving valid inference for $\beta$ in NSTW models requires accounting for both the latent structure and possible bias from group discretization or weak factor approximation. When clustering/grouping adequately captures the heterogeneity, the TSGF or TSGF-M estimator is asymptotically normal and $\sqrt{NT}$ -consistent.

Diagnostic tools—Pesaran-type cross-sectional dependence (CD, CDw, CD*) tests, and factor number selection (Eigenvalue Ratio (ER), Growth Rate (GR), GOS)—aid in verifying that the estimator has adequately absorbed common latent components (Ditzen et al., 25 Jul 2025). In cluster-based approaches, approximation error rates depend on the group counts and the smoothness of $h$ .

Variance formulas for estimators of fixed effect components must account for network connectivity. For network/graph-based NSTW models, the variance of individual fixed effects is dominated by local degree and global connectivity (e.g., the eigenvalue $\lambda_2$ and harmonic neighbor mean $h_i$ in the normalized Laplacian) (Jochmans et al., 2016).

5. Empirical Illustration and Comparative Findings

The practical impact of NSTW modeling is evident in empirical studies:

In teacher value-added models, sparse bipartite graphs (few teachers per student, little mixing) yield ill-identified effects: the true variance is underestimated by standard approximations by a factor of 2.5, and confidence intervals for teacher effects are greatly over-optimistic when using TWFE-type variance estimates (Jochmans et al., 2016).
Occupational wage decompositions with dense worker-occupation networks exhibit strong global connectivity, ensuring that traditional fixed-effects methods and their variance estimates are nearly accurate.
In the growth–inflation and Feldstein–Horioka puzzles, residual cross-section dependence is eliminated only when high-dimensional IFE or TSGF-M estimators are used; conventional FE estimators leave persistent structure in residuals (Ditzen et al., 25 Jul 2025).
Simulation studies confirm that as the number of factors or clusters increases, both the bias and variance of $\hat\beta$ improve; yet, over-parameterization may elevate variance if clustering is too granular relative to sample size (Freeman et al., 2021).

6. Diagnostics, Specification Testing, and Model Assessment

Validity of the NSTW approach must be empirically verified. Specification tests compare restricted models (TWFE/IFE) to grouped fixed effects estimators:

A bootstrap generalized Hausman test contrasts the restricted estimator with a more flexible GFE estimator, with critical values adjusted to account for approximation biases and the incidental parameter problem, yielding asymptotically correct size and good power (Pigini et al., 2023).
Standard residual cross-section dependence diagnostics (CD, CDw, CD*) and eigenvalue-based methods for factor number estimation are central to practice, ensuring that latent heterogeneity is adequately captured.

Model selection—balancing overfitting and bias—is guided by these diagnostics, with clustering counts (G,C) or the number of factors optimally chosen to minimize finite-sample mean squared error given the presumed smoothness of $h$ .

7. Implications for Network Data, Dense vs. Sparse Regimes, and Generalizations

NSTW methods clarify that the precision of fixed effect estimates in network data is controlled not simply by the total sample size but by both local and global connectivity:

In dense networks with high degrees and global mixing ( $\lambda_2$ bounded away from zero), parametric rates of convergence (e.g., $1/\sqrt{d_i}$ for fixed effect variance) and normality are achieved.
Sparse networks with weak connectivity face enlarged variances, and the effective sample size for estimating an effect is essentially its local degree.
In practical NSTW estimation, attention must be paid to the network’s structural properties to avoid underestimating uncertainty in fixed effect (e.g., teacher, firm) estimands.

When the NSTW framework is neglected in favor of restrictive TWFE or IFE models, bias and invalid inference are likely, particularly in the presence of nonlinear or interactive omitted variable structures, weak connectivity, or substantial latent group-time interaction heterogeneity. Proper modeling and diagnosis, using grouped fixed effects, high-rank factor approximations, and network analysis, enable robust identification and inference in modern panel and network datasets.