Tukey Model in Robust Estimation
- The Tukey model is a robust contamination paradigm that formalizes how adversarial outliers affect statistical estimators via Huber’s additive and total variation frameworks.
- It underpins the assessment of breakdown points in high-dimensional, multivariate settings, guiding evaluation of estimators like the Tukey median.
- Its applications extend to robust mean estimation and algorithm design, with projection methods offering enhanced robustness under additional structural assumptions.
The Tukey model in robust statistics refers to a foundational contamination paradigm that governs the influence of anomalous observations in statistical estimation. It provides the framework for evaluating the robustness of estimators, particularly in the context of high-dimensional and multivariate data, by characterizing their resistance to adversarial data corruption. This model sets the theoretical standard for both estimator construction (such as the Tukey median) and the quantification of their breakdown properties under different adversarial capabilities, notably Huber's additive contamination and stronger total variation (TV) corruptions.
1. Definition: The Classical Tukey Contamination Model
The core of the Tukey model is its formalization of adversarial contamination. For a distribution on , the contaminated model is given by replacing a fraction of the distribution with arbitrary noise: This is also known as Huber's additive contamination model. Here, the adversary can introduce outliers by substituting a portion of the samples with observations from an unrestricted distribution . The total variation (TV) contamination model, which generalizes the above, allows the adversary both to add and remove probability mass: where denotes the total variation distance. The TV model is strictly more powerful and thus more adversarial than additive contamination.
2. Breakdown Point: Formalism and Consequences
The central robustness criterion in the Tukey model is the breakdown point of an estimator, defined as the maximum fraction of contamination such that the estimator remains bounded regardless of the specific realization of corruption:
where represents the contamination class (additive or TV), and denotes the estimator (e.g., Tukey median). The breakdown point is taken over adversarial replacements within the contamination model, and for a family of distributions , robustness is assessed as the infimum over all .
Breakdown points encapsulate the maximal tolerable contamination for a given estimator. This paradigm supplies the decision rules that justify or refute specific methods in robust statistics and underlies the information-theoretic boundaries for robust estimation.
3. Tukey Median: Definition and High-Dimensional Behavior
3.1 Tukey Median via Tukey Depth
The Tukey median generalizes the univariate median to multivariate settings via halfspace (Tukey) depth:
The Tukey median is the maximizer(s) of depth—points most centrally situated with respect to the underlying distribution.
3.2 Breakdown Point Results for Halfspace-Symmetric Distributions
For the class of halfspace-symmetric distributions, the Tukey median's breakdown point under differing contamination models is sharply characterized: | Corruption Model | Estimator | | | | |----------------------|---------------------|-------|-------|-----------| | Additive (Huber's) | Tukey median | $1/2$ | $1/3$ | $1/3$ | | Total Variation (TV) | Tukey median | $1/2$ | $1/3$ | $1/4$ | | Total Variation (TV) | Projection alg. | $1/2$ | $1/2$ | $1/2$ |
Under Huber's model, the Tukey median achieves the optimal univariate breakdown point $1/2$, but in dimensions , the point drops to $1/3$. In the stronger TV model, the breakdown point reduces further to $1/4$ in dimensions , revealing a severe limitation: only a quarter of contamination can be tolerated by the Tukey median in high dimensions. These results delimit the estimator's robustness frontier for high-dimensional, affine-equivariant estimation.
4. Sample Complexity and Bias of the Tukey Median
For halfspace-symmetric and other well-behaved multivariate distributions (e.g., Gaussian), the maximum bias of the Tukey median in finite samples remains controlled provided there is sufficient measure near the true center. In the contaminated setting with sample size and contamination level , both the population and finite-sample breakdown coincide, with samples sufficing to ensure the estimator is as robust as the breakdown analysis predicts.
Explicitly, the bias due to contamination is upper bounded as a function of under mass and moment constraints near the center, and the statistical sample complexity is linear in dimension, matching the best possible rates for robust mean estimation given adversarial outliers.
5. Projection Algorithm and Attainment of the Information-Theoretic Limit
A projection algorithm is introduced to overcome the suboptimality of the Tukey median:
where is a class of distributions, e.g., halfspace-symmetric with decay . This algorithm projects the observed (potentially contaminated) empirical measure onto the desired structured family via the halfspace distance and then reports the mean of the projection.
Key properties:
- Breakdown point: $1/2$ (the information-theoretic maximum possible for nonparametric mean estimation) under TV contamination, regardless of .
- Sample complexity: matches the Tukey median.
- Tradeoff: The projection method achieves optimal robustness only with access to additional structural information about the distribution (via ), and may lack affine equivariance, unlike the fully nonparametric Tukey median.
The Tukey median thus represents the most robust fully nonparametric, affine-equivariant estimator, but projection approaches can surpass its breakdown point if weakly parametric assumptions are accepted.
6. Practical Implications and Open Problems
6.1 Implications for Robust Mean Estimation
The key lesson is that under strong adversarial contamination (TV distance), the Tukey median does not suffice for optimal robust mean estimation in , and more aggressive procedures are required:
- To achieve robustness up to $1/2$ contamination, projection techniques or estimators leveraging additional structure are necessary.
- The quest for an estimator that is both affine-equivariant and $1/2$ breakdown with strong finite-sample guarantees remains unresolved; no such estimator exists among translation-equivariant procedures.
6.2 Theoretical Frontiers
- The results establish the sharp threshold for nonparametric affine-equivariant location estimation in high dimensions: $1/4$ under TV contamination is unimprovable for the Tukey median.
- For halfspace-symmetric distributions, this delineates the class for which depth-based methods are maximally robust.
6.3 Summary Table: Breakdown Points
| Model | |||
|---|---|---|---|
| Tukey median (additive) | $1/2$ | $1/3$ | $1/3$ |
| Tukey median (TV) | $1/2$ | $1/3$ | $1/4$ |
| Projection (TV) | $1/2$ | $1/2$ | $1/2$ |
7. Connections to Related Notions and Future Directions
The Tukey model is central to contemporary robust statistics and underpins both theoretical and algorithmic advances in high-dimensional settings:
- Depth-based statistics (e.g., Tukey depth regions, medians) rely on the contamination paradigm for their robustness guarantees.
- Extensions of the model, such as cellwise or independent contaminations (Agostinelli et al., 2014), demand fundamentally new estimators.
- Ongoing research aims to close the gap between affine equivariance, computational feasibility, and optimal adversarial breakdown, particularly under the TV contamination model.
These advances clarify both the power and limitations of classic robust methods rooted in the Tukey model, and they guide the choice of estimators in practical applications requiring stringent contamination resistance.