Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s
GPT-5 High 42 tok/s Pro
GPT-4o 104 tok/s
GPT OSS 120B 474 tok/s Pro
Kimi K2 256 tok/s Pro
2000 character limit reached

Centered Internal IV Estimator for Clustered Data

Updated 20 August 2025
  • The estimator is a novel IV procedure that centers internal instruments to mitigate bias in clustered data and high-dimensional controls.
  • It employs a leave-out mechanism via a tailored weighting matrix to satisfy partialling-out and correct centering conditions for robust inference.
  • It offers practical advantages in settings with feedback, interference, and weak identification by ensuring consistent estimation under mild regularity conditions.

The correctly centered internal IV estimator is an instrumental variables procedure designed for linear regression models with clustered data, high-dimensional controls, and general patterns of exclusion restrictions. The estimator addresses the bias that arises when internal instruments—constructed from existing regressors or their transformations—are not exogenous with respect to all error components, especially under clustering, feedback, or interference. Instead of seeking unbiasedness in the traditional sense, it enforces a weaker but sufficient moment “centering” property on the estimator's numerator, ensuring consistent estimation under mild regularity conditions and facilitating robust inference even in settings with within-cluster dependence and weak identification (Mikusheva et al., 18 Aug 2025).

1. Definition and Motivation

The correctly centered internal IV estimator is defined for models where the regression specification

Y=Xβ+Wγ+ϵY = X\beta + W\gamma + \epsilon

includes endogenous regressors XX, high-dimensional controls WW, and potentially complex exclusion patterns—encoded by a user-specified indicator matrix E\mathcal{E}. The presence of clustering and non-random assignment (e.g., fixed effects, feedback, or spillovers) induces bias in standard OLS or naive partialling-out procedures due to correlation between the endogenous regressors and disturbances within clusters.

The estimator’s defining property is not classical unbiasedness (which may be impossible when denominators are random and endogenous), but “correct centering” of the estimating equation: for any distribution FF in a class of plausible data-generating processes,

EF[C1(x,y)]=βEF[C2(x)]\mathbb{E}_F[ C_1(x, y) ] = \beta\, \mathbb{E}_F[ C_2(x) ]

where β^=C1(x,y)/C2(x)\hat\beta = C_1(x, y) / C_2(x). This ensures that bias attributable to internal instrument mis-centering is controlled in key directions, supporting asymptotic consistency.

2. Mathematical Formulation

The estimator takes the form

β^A=xAyxAx\hat{\beta}^A = \frac{x' A y}{x' A x}

where xx and yy are the concatenated design and outcome vectors, and AA is an n×nn\times n deterministic matrix, usually determined by two conditions:

  • Partialling-Out Property (POP): AM=AA M = A, where M=InW(WW)1WM = I_n - W(W'W)^{-1}W' is the projection that residualizes high-dimensional controls. This guarantees that the estimator residualizes only along restricted directions and does not “over-partial out” when internal instruments cannot be fully separated from controls.
  • Correct Centering Condition (CC): For any pair (i,j)(i,j) where Eij=0\mathcal{E}_{ij}=0 (i.e., no exclusion restriction is imposed for that pair), Aij=0A_{ij} = 0. This structurally zeros out moment conditions prone to feedback or violation of exogeneity.

The default estimator utilizes the matrix AA^* that is closest (in Frobenius norm) to the standard projection MM among all matrices satisfying POP and CC: A=argminAAAMFA^* = \arg\min_{A \in \mathcal{A}} \| A - M \|_F where A\mathcal{A} is the set of matrices obeying both POP and CC. This construction delivers a leave-out “internal” IV: for each focal observation ii, the ii-th row of AA^* corresponds to projecting out controls using only observations with which ii shares a valid exclusion restriction (Mikusheva et al., 18 Aug 2025).

3. Leave-Out Interpretation and Computational Strategy

A key property of AA^* is its leave-out, sample-splitting structure: for each observation ii, define Si\mathcal{S}_i as the subset of data points with which exclusion restrictions are believed to hold (i.e., for which Eij=1\mathcal{E}_{ij} = 1). For each ii, the matrix block M(i)M^{(i)} is the projection that residualizes controls using only Si\mathcal{S}_i. Thus,

Aij=Mij(i)A^*_{ij} = M^{(i)}_{ij}

This guarantees that, for each estimating moment, only information from “safe” pairs is used—minimizing potential bias due to within-cluster feedback, interference, or contamination.

Computationally, the structure of E\mathcal{E} and potential cluster independence often allow for block-diagonal or local computation. For instance, in cluster settings, both controls and exclusion matrix can be organized so that within-cluster estimators are computed independently, and across-cluster moments are fully exploited when exogeneity holds.

4. Robustness and Theoretical Properties

The estimator provides robustness to violation of traditional instrument exogeneity within clusters and to feedback/adaptive interventions. By imposing CC only on pairs for which exogeneity is plausible (as indicated in E\mathcal{E}), it naturally adjusts to partial identification environments and spatial/temporal interference.

The paper develops a central limit theorem tailored for quadratic forms (such as xAex'A^*e) under clustered data and high-dimensional controls via a U-statistic (Hoeffding) decomposition. Under regularity conditions (notably, that cluster sizes are moderate relative to the total sample and error moments are bounded), the estimator is asymptotically normal.

For inference, a jackknife variance estimator is proposed that remains valid—possibly conservative—under arbitrary within-cluster dependence and weak identification. The construction also supports Anderson–Rubin (AR) test statistics (robust to weak IV), with AR confidence intervals asymptotically matching those from standard t-statistics whenever identification is strong.

5. Modeling of Exclusion Restrictions and Practical Implications

A distinguishing feature is the explicit modeling of the exclusion restriction structure via the indicator matrix E\mathcal{E}, which encodes where the internal IV moment conditions are believed valid. For instance:

  • Across-cluster exclusion is typically enforced by setting Eij=1\mathcal{E}_{ij}=1 for all pairs in different clusters; within-cluster pairs with possible feedback can have Eij=0\mathcal{E}_{ij}=0.
  • In dynamic panels, lag-lead feedbacks can be accounted for by enforcing CC for contemporaneous pairs and suppressing it in feedback directions.
  • In spatial panels or networks (e.g., fiscal interventions with interference across geographic units), the spatial decay of interference is modeled via a cutoff on dijd_{ij}, with Eij=1\mathcal{E}_{ij}=1 for pairs beyond a threshold.

By allowing the practitioner to encode economic or biological knowledge into E\mathcal{E}, the approach generalizes classical dynamic panels and spatial regression estimators.

6. Empirical Illustration and Inference under Weak Identification

The estimator’s practical usage is demonstrated in an empirical application studying a large-scale fiscal intervention in rural Kenya under spatial interference. Villages are clustered, and exogeneity of instrument assignment is assumed only for sufficiently distant pairs (e.g., those separated by more than 2 km). By varying the cutoff distance, the analyst can trace the sensitivity of the estimated direct treatment effect and its confidence interval to the degree of allowable interference. Wider cutoffs increase robustness to spillovers but reduce effective sample size (as measured by the matrix trace trace(A)\mathrm{trace}(A^*)). Simultaneously, the AR test-based confidence intervals remain valid regardless of instrument strength.

This sensitivity and robustness provide practitioners with a flexible toolkit for IV estimation in the presence of high-dimensional controls, clustering, and complex exclusion patterns.

7. Theoretical Considerations and Extensions

The estimator fundamentally relaxes the requirement for global unbiasedness, focusing on correct centering relative to the data-generating process and specified exogeneity. The U-structure and leave-out machinery facilitate extension to nonlinear models or higher-order moment conditions, though these remain open directions. The approach is explicitly designed to work under arbitrary within-cluster dependence, potentially unbalanced cluster sizes, and high-dimensional control settings. Classic Nickell bias and failure modes of naive panel IV estimators are directly repaired by the estimator’s centering correction (Mikusheva et al., 18 Aug 2025).

Summary Table: Correctly Centered Internal IV Estimator—Key Features

Feature Implementation Purpose/Benefit
Weighting matrix AA^* Closest matrix satisfying POP and CC (Frobenius norm) Ensures partialling out and moment centering
Exclusion indicator E\mathcal{E} User-encoded exclusion structure Models where moment restrictions are valid
Leave-out structure Row-wise tailored residualization Prevents contamination from invalid pairs
Cluster/dependence support Computed independently within clusters Handles within-cluster endogeneity
Robust variance, AR CI Jackknife estimator and AR test Valid inference even under weak ID

The correctly centered internal IV estimator thus extends the IV methodology to clustered, high-dimensional, and partially identified settings by centering the internal moment functions according to explicit exclusion assumptions, supporting robust and computationally tractable inference in complex empirical environments (Mikusheva et al., 18 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)