Centered Internal IV Estimator for Clustered Data
- The estimator is a novel IV procedure that centers internal instruments to mitigate bias in clustered data and high-dimensional controls.
- It employs a leave-out mechanism via a tailored weighting matrix to satisfy partialling-out and correct centering conditions for robust inference.
- It offers practical advantages in settings with feedback, interference, and weak identification by ensuring consistent estimation under mild regularity conditions.
The correctly centered internal IV estimator is an instrumental variables procedure designed for linear regression models with clustered data, high-dimensional controls, and general patterns of exclusion restrictions. The estimator addresses the bias that arises when internal instruments—constructed from existing regressors or their transformations—are not exogenous with respect to all error components, especially under clustering, feedback, or interference. Instead of seeking unbiasedness in the traditional sense, it enforces a weaker but sufficient moment “centering” property on the estimator's numerator, ensuring consistent estimation under mild regularity conditions and facilitating robust inference even in settings with within-cluster dependence and weak identification (Mikusheva et al., 18 Aug 2025).
1. Definition and Motivation
The correctly centered internal IV estimator is defined for models where the regression specification
includes endogenous regressors , high-dimensional controls , and potentially complex exclusion patterns—encoded by a user-specified indicator matrix . The presence of clustering and non-random assignment (e.g., fixed effects, feedback, or spillovers) induces bias in standard OLS or naive partialling-out procedures due to correlation between the endogenous regressors and disturbances within clusters.
The estimator’s defining property is not classical unbiasedness (which may be impossible when denominators are random and endogenous), but “correct centering” of the estimating equation: for any distribution in a class of plausible data-generating processes,
where . This ensures that bias attributable to internal instrument mis-centering is controlled in key directions, supporting asymptotic consistency.
2. Mathematical Formulation
The estimator takes the form
where and are the concatenated design and outcome vectors, and is an deterministic matrix, usually determined by two conditions:
- Partialling-Out Property (POP): , where is the projection that residualizes high-dimensional controls. This guarantees that the estimator residualizes only along restricted directions and does not “over-partial out” when internal instruments cannot be fully separated from controls.
- Correct Centering Condition (CC): For any pair where (i.e., no exclusion restriction is imposed for that pair), . This structurally zeros out moment conditions prone to feedback or violation of exogeneity.
The default estimator utilizes the matrix that is closest (in Frobenius norm) to the standard projection among all matrices satisfying POP and CC: where is the set of matrices obeying both POP and CC. This construction delivers a leave-out “internal” IV: for each focal observation , the -th row of corresponds to projecting out controls using only observations with which shares a valid exclusion restriction (Mikusheva et al., 18 Aug 2025).
3. Leave-Out Interpretation and Computational Strategy
A key property of is its leave-out, sample-splitting structure: for each observation , define as the subset of data points with which exclusion restrictions are believed to hold (i.e., for which ). For each , the matrix block is the projection that residualizes controls using only . Thus,
This guarantees that, for each estimating moment, only information from “safe” pairs is used—minimizing potential bias due to within-cluster feedback, interference, or contamination.
Computationally, the structure of and potential cluster independence often allow for block-diagonal or local computation. For instance, in cluster settings, both controls and exclusion matrix can be organized so that within-cluster estimators are computed independently, and across-cluster moments are fully exploited when exogeneity holds.
4. Robustness and Theoretical Properties
The estimator provides robustness to violation of traditional instrument exogeneity within clusters and to feedback/adaptive interventions. By imposing CC only on pairs for which exogeneity is plausible (as indicated in ), it naturally adjusts to partial identification environments and spatial/temporal interference.
The paper develops a central limit theorem tailored for quadratic forms (such as ) under clustered data and high-dimensional controls via a U-statistic (Hoeffding) decomposition. Under regularity conditions (notably, that cluster sizes are moderate relative to the total sample and error moments are bounded), the estimator is asymptotically normal.
For inference, a jackknife variance estimator is proposed that remains valid—possibly conservative—under arbitrary within-cluster dependence and weak identification. The construction also supports Anderson–Rubin (AR) test statistics (robust to weak IV), with AR confidence intervals asymptotically matching those from standard t-statistics whenever identification is strong.
5. Modeling of Exclusion Restrictions and Practical Implications
A distinguishing feature is the explicit modeling of the exclusion restriction structure via the indicator matrix , which encodes where the internal IV moment conditions are believed valid. For instance:
- Across-cluster exclusion is typically enforced by setting for all pairs in different clusters; within-cluster pairs with possible feedback can have .
- In dynamic panels, lag-lead feedbacks can be accounted for by enforcing CC for contemporaneous pairs and suppressing it in feedback directions.
- In spatial panels or networks (e.g., fiscal interventions with interference across geographic units), the spatial decay of interference is modeled via a cutoff on , with for pairs beyond a threshold.
By allowing the practitioner to encode economic or biological knowledge into , the approach generalizes classical dynamic panels and spatial regression estimators.
6. Empirical Illustration and Inference under Weak Identification
The estimator’s practical usage is demonstrated in an empirical application studying a large-scale fiscal intervention in rural Kenya under spatial interference. Villages are clustered, and exogeneity of instrument assignment is assumed only for sufficiently distant pairs (e.g., those separated by more than 2 km). By varying the cutoff distance, the analyst can trace the sensitivity of the estimated direct treatment effect and its confidence interval to the degree of allowable interference. Wider cutoffs increase robustness to spillovers but reduce effective sample size (as measured by the matrix trace ). Simultaneously, the AR test-based confidence intervals remain valid regardless of instrument strength.
This sensitivity and robustness provide practitioners with a flexible toolkit for IV estimation in the presence of high-dimensional controls, clustering, and complex exclusion patterns.
7. Theoretical Considerations and Extensions
The estimator fundamentally relaxes the requirement for global unbiasedness, focusing on correct centering relative to the data-generating process and specified exogeneity. The U-structure and leave-out machinery facilitate extension to nonlinear models or higher-order moment conditions, though these remain open directions. The approach is explicitly designed to work under arbitrary within-cluster dependence, potentially unbalanced cluster sizes, and high-dimensional control settings. Classic Nickell bias and failure modes of naive panel IV estimators are directly repaired by the estimator’s centering correction (Mikusheva et al., 18 Aug 2025).
Summary Table: Correctly Centered Internal IV Estimator—Key Features
Feature | Implementation | Purpose/Benefit |
---|---|---|
Weighting matrix | Closest matrix satisfying POP and CC (Frobenius norm) | Ensures partialling out and moment centering |
Exclusion indicator | User-encoded exclusion structure | Models where moment restrictions are valid |
Leave-out structure | Row-wise tailored residualization | Prevents contamination from invalid pairs |
Cluster/dependence support | Computed independently within clusters | Handles within-cluster endogeneity |
Robust variance, AR CI | Jackknife estimator and AR test | Valid inference even under weak ID |
The correctly centered internal IV estimator thus extends the IV methodology to clustered, high-dimensional, and partially identified settings by centering the internal moment functions according to explicit exclusion assumptions, supporting robust and computationally tractable inference in complex empirical environments (Mikusheva et al., 18 Aug 2025).