Papers
Topics
Authors
Recent
2000 character limit reached

Tukey Model in Robust Estimation

Updated 28 October 2025
  • The Tukey model is a robust contamination paradigm that formalizes how adversarial outliers affect statistical estimators via Huber’s additive and total variation frameworks.
  • It underpins the assessment of breakdown points in high-dimensional, multivariate settings, guiding evaluation of estimators like the Tukey median.
  • Its applications extend to robust mean estimation and algorithm design, with projection methods offering enhanced robustness under additional structural assumptions.

The Tukey model in robust statistics refers to a foundational contamination paradigm that governs the influence of anomalous observations in statistical estimation. It provides the framework for evaluating the robustness of estimators, particularly in the context of high-dimensional and multivariate data, by characterizing their resistance to adversarial data corruption. This model sets the theoretical standard for both estimator construction (such as the Tukey median) and the quantification of their breakdown properties under different adversarial capabilities, notably Huber's additive contamination and stronger total variation (TV) corruptions.

1. Definition: The Classical Tukey Contamination Model

The core of the Tukey model is its formalization of adversarial contamination. For a distribution pp^* on Rd\mathbb{R}^d, the contaminated model is given by replacing a fraction ϵ\epsilon of the distribution with arbitrary noise: p=(1ϵ)p+ϵr,r arbitraryp = (1 - \epsilon) p^* + \epsilon r, \qquad r \text{ arbitrary} This is also known as Huber's additive contamination model. Here, the adversary can introduce outliers by substituting a portion ϵ\epsilon of the samples with observations from an unrestricted distribution rr. The total variation (TV) contamination model, which generalizes the above, allows the adversary both to add and remove probability mass: TV(p,p)ϵTV(p, p^*) \leq \epsilon where TV(,)TV(\cdot, \cdot) denotes the total variation distance. The TV model is strictly more powerful and thus more adversarial than additive contamination.

2. Breakdown Point: Formalism and Consequences

The central robustness criterion in the Tukey model is the breakdown point of an estimator, defined as the maximum fraction ϵ\epsilon of contamination such that the estimator remains bounded regardless of the specific realization of corruption: b(p,ϵ)=suppC(p,ϵ),xT(p),yT(p)xyb(p^*, \epsilon) = \sup_{p \in \mathcal{C}(p^*, \epsilon),\, x \in T(p),\, y \in T(p^*)} \|x - y\|

ϵ(p)=inf{ϵ:b(p,ϵ)=}\epsilon^*(p^*) = \inf \{\, \epsilon : b(p^*, \epsilon) = \infty \,\}

where C\mathcal{C} represents the contamination class (additive or TV), and T()T(\cdot) denotes the estimator (e.g., Tukey median). The breakdown point is taken over adversarial replacements within the contamination model, and for a family of distributions GG, robustness is assessed as the infimum over all pGp^* \in G.

Breakdown points encapsulate the maximal tolerable contamination for a given estimator. This paradigm supplies the decision rules that justify or refute specific methods in robust statistics and underlies the information-theoretic boundaries for robust estimation.

3. Tukey Median: Definition and High-Dimensional Behavior

3.1 Tukey Median via Tukey Depth

The Tukey median generalizes the univariate median to multivariate settings via halfspace (Tukey) depth: DTukey(μ,p)=infvRdp(v(Xμ)0)D_{\mathsf{Tukey}}(\mu, p) = \inf_{v \in \mathbb{R}^d} p( v^\top (X-\mu) \geq 0 )

T(p)=argmaxμRdDTukey(μ,p)T(p) = \arg\max_{\mu \in \mathbb{R}^d} D_{\mathsf{Tukey}}(\mu, p)

The Tukey median is the maximizer(s) of depth—points most centrally situated with respect to the underlying distribution.

3.2 Breakdown Point Results for Halfspace-Symmetric Distributions

For the class of halfspace-symmetric distributions, the Tukey median's breakdown point under differing contamination models is sharply characterized: | Corruption Model | Estimator | d=1d=1 | d=2d=2 | d3d\geq 3 | |----------------------|---------------------|-------|-------|-----------| | Additive (Huber's) | Tukey median | $1/2$ | $1/3$ | $1/3$ | | Total Variation (TV) | Tukey median | $1/2$ | $1/3$ | $1/4$ | | Total Variation (TV) | Projection alg. | $1/2$ | $1/2$ | $1/2$ |

Under Huber's model, the Tukey median achieves the optimal univariate breakdown point $1/2$, but in dimensions d2d\geq 2, the point drops to $1/3$. In the stronger TV model, the breakdown point reduces further to $1/4$ in dimensions d3d\geq3, revealing a severe limitation: only a quarter of contamination can be tolerated by the Tukey median in high dimensions. These results delimit the estimator's robustness frontier for high-dimensional, affine-equivariant estimation.

4. Sample Complexity and Bias of the Tukey Median

For halfspace-symmetric and other well-behaved multivariate distributions (e.g., Gaussian), the maximum bias of the Tukey median in finite samples remains controlled provided there is sufficient measure near the true center. In the contaminated setting with sample size nn and contamination level ϵ\epsilon, both the population and finite-sample breakdown coincide, with O(d/ϵ2)O(d/\epsilon^2) samples sufficing to ensure the estimator is as robust as the breakdown analysis predicts.

Explicitly, the bias due to contamination is upper bounded as a function of ϵ\epsilon under mass and moment constraints near the center, and the statistical sample complexity is linear in dimension, matching the best possible rates for robust mean estimation given adversarial outliers.

5. Projection Algorithm and Attainment of the Information-Theoretic Limit

A projection algorithm is introduced to overcome the suboptimality of the Tukey median: μ^(p)=Eq[X],q=argminqG(h)(q,p)\hat{\mu}(p) = \mathbb{E}_q[X], \quad q = \arg\min_{q \in G(h)} (q, p)

(p,q)=supvRd, tRp(vXt)q(vXt)(p, q) = \sup_{v\in \mathbb{R}^d,\ t\in \mathbb{R}}\, | p( v^\top X \geq t ) - q( v^\top X \geq t ) |

where G(h)G(h) is a class of distributions, e.g., halfspace-symmetric with decay hh. This algorithm projects the observed (potentially contaminated) empirical measure onto the desired structured family via the halfspace distance and then reports the mean of the projection.

Key properties:

  • Breakdown point: $1/2$ (the information-theoretic maximum possible for nonparametric mean estimation) under TV contamination, regardless of dd.
  • Sample complexity: O(d)O(d) matches the Tukey median.
  • Tradeoff: The projection method achieves optimal robustness only with access to additional structural information about the distribution (via hh), and may lack affine equivariance, unlike the fully nonparametric Tukey median.

The Tukey median thus represents the most robust fully nonparametric, affine-equivariant estimator, but projection approaches can surpass its breakdown point if weakly parametric assumptions are accepted.

6. Practical Implications and Open Problems

6.1 Implications for Robust Mean Estimation

The key lesson is that under strong adversarial contamination (TV distance), the Tukey median does not suffice for optimal robust mean estimation in d3d\geq3, and more aggressive procedures are required:

  • To achieve robustness up to $1/2$ contamination, projection techniques or estimators leveraging additional structure are necessary.
  • The quest for an estimator that is both affine-equivariant and $1/2$ breakdown with strong finite-sample guarantees remains unresolved; no such estimator exists among translation-equivariant procedures.

6.2 Theoretical Frontiers

  • The results establish the sharp threshold for nonparametric affine-equivariant location estimation in high dimensions: $1/4$ under TV contamination is unimprovable for the Tukey median.
  • For halfspace-symmetric distributions, this delineates the class for which depth-based methods are maximally robust.

6.3 Summary Table: Breakdown Points

Model d=1d=1 d=2d=2 d3d\geq 3
Tukey median (additive) $1/2$ $1/3$ $1/3$
Tukey median (TV) $1/2$ $1/3$ $1/4$
Projection (TV) $1/2$ $1/2$ $1/2$

The Tukey model is central to contemporary robust statistics and underpins both theoretical and algorithmic advances in high-dimensional settings:

  • Depth-based statistics (e.g., Tukey depth regions, medians) rely on the contamination paradigm for their robustness guarantees.
  • Extensions of the model, such as cellwise or independent contaminations (Agostinelli et al., 2014), demand fundamentally new estimators.
  • Ongoing research aims to close the gap between affine equivariance, computational feasibility, and optimal adversarial breakdown, particularly under the TV contamination model.

These advances clarify both the power and limitations of classic robust methods rooted in the Tukey model, and they guide the choice of estimators in practical applications requiring stringent contamination resistance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Tukey Model.