Azadkia-Chatterjee Coefficient
- The Azadkia–Chatterjee coefficient is a nonparametric, rank-based measure defined via conditional probability variance that ranges from 0 (independence) to 1 (functional dependence).
- It employs graph-based estimators using nearest neighbor ranks to achieve strong consistency, parametric rates, and established asymptotic normality in both marginal and conditional settings.
- Its extensions include multivariate responses and scale-invariant variants, making it central for independence testing, graphical models, and model-free variable selection.
The Azadkia–Chatterjee coefficient is a nonparametric, rank-based measure of directed dependence between a vector-valued predictor and a univariate or multivariate response, defined at the population level via the variance of conditional probabilities and estimated using nearest-neighbor graphs. It features an interpretable scale—zero under independence and one under functional dependence—and a graph-based empirical estimator that admits parametric rates, strong consistency, bandwidth-free implementation, and central limit theorems in both marginal and conditional versions. Multivariate extensions, scale-invariant variants, and connections to broader classes of geometric graph and kernel-based dependence measures position the coefficient as a central object for independence testing, graphical models, and model-free variable selection.
1. Definition and Fundamental Properties
Let be jointly distributed random elements with and %%%%2%%%% either univariate or a vector in . The Azadkia–Chatterjee (AC) coefficient for on is defined by
An equivalent form based on the cumulative distribution of yields, for continuous : Characterizing properties:
- if and only if and are independent.
- if and only if is almost surely a measurable function of .
The definition is directional and scale-invariant: strictly increasing transformations of or bijections of preserve (Ansari et al., 14 Mar 2025, Ansari et al., 2022). For conditional dependence, define jointly and set
if and only if , and if and only if is a function of given (Shi et al., 2021, Huang et al., 2020).
2. Graph-Based and Rank-Based Estimator Construction
For i.i.d. data , construct the following graph-based estimator:
- Compute the univariate ranks .
- Let .
- The empirical AC coefficient is
This estimator generalizes Chatterjee's original proposal to multivariate covariates by utilizing nearest-neighbor graphs in (Lin et al., 2022).
Multivariate response: For , a "chain rule" or copula-based construction is used (Ansari et al., 2022, Huang et al., 8 Dec 2025): with , reduces to for , and can be strongly consistently estimated using graph-based estimators for each univariate constituent.
Scale invariance: The standard estimator is not invariant to affine changes in ; a fully scale-invariant version uses coordinatewise rank transforms in before constructing the NNG (Tran et al., 3 Dec 2024).
3. Distributional Properties and Limit Theory
Asymptotic Normality and Variance Bounds
The central limit theorem holds under broad conditions. For i.i.d. draws from a continuous law: whenever is not a measurable function of (Lin et al., 2022). The asymptotic variance satisfies: and, under absolute continuity of , a sharper bound involving explicit dimension-dependent constants.
When , with , linked to the geometry of the NNG in (Lin et al., 2022, Han et al., 2022). Under manifold support, the limiting variance depends solely on the intrinsic dimension.
A consistent explicit estimator of the variance is available, allowing for valid inference (Lin et al., 2022).
Symmetric and Conditional Extensions
A symmetrized version, taking , allows construction of two-sided tests—its limit law under independence is skew-normal with explicit variance (Zhang, 2022).
The conditional AC coefficient admits an empirical estimator with parallel asymptotics; under independence, is asymptotically normal with variance determined by the dimensions of the variables and graph-count statistics (Shi et al., 2021).
Continuity Considerations
Unlike classical measures (Spearman's rho, Kendall's tau), is not weakly continuous under distributional convergence. Instead, it is continuous with respect to convergence of Markov products (pairs of conditionally i.i.d. copies) under additional marginal quantile convergence or specific copula convergence. Practical families and models (ellipticals, Archimedeans, noises) satisfy required continuity, so stable large-sample inference is possible within these classes (Ansari et al., 14 Mar 2025).
4. Algorithmic and Computational Aspects
- Nearest-neighbor graph construction can be done in (brute force for small ; kd-trees or approximate methods for larger ).
- Rank computations for (and optionally for in the scale-invariant version) cost per coordinate.
- Multivariate response: Efficient merge-sort or divide-and-conquer algorithms exist for blockwise rank counts, with time complexity (Huang et al., 8 Dec 2025).
- For each observation , nearest-neighbor search and rank calculations admit nearly linear scaling, enabling use in large datasets.
5. Connections to Broader Dependence Measures
The AC coefficient is a specific instance within the family of graph–RKHS–OT dependency measures (Deb et al., 2020, Deb et al., 20 Nov 2024):
- Population level: For sufficiently rich kernels (e.g., the min kernel on , or the indicator-integral kernel), the corresponding normalized conditional MMD directly recovers .
- Sample level: The estimator is a geometric graph functional over empirical OT ranks.
- Distribution-free: Under the null of independence, the law of the AC coefficient (when computed using empirical OT ranks and graph structure) is exactly permutation invariant, enabling finite-sample calibration for independence tests.
Multivariate extensions (both in predictors and responses) and conditional variants fit naturally into this graph–kernel framework, relating directly to kernel partial correlation (Huang et al., 2020), distance multivariance, and more general measures indexed by RKHS (Deb et al., 20 Nov 2024).
6. Practical Application Domains
Independence and Conditional Independence Testing
The AC coefficient and its conditional extension are used for:
- Testing independence in arbitrary dimensions (direct, distribution-free under the null, with consistent critical values).
- Conditional independence testing, e.g., through graph-based statistics evaluated with (conditional) randomization tests (Shi et al., 2021). However, these are known to exhibit low local power against contiguous local alternatives unless the nearest-neighbor graph is appropriately generalized or replaced with -NN approaches.
Graphical Model Structure Learning
Pairwise conditional AC coefficients are used as entries in adjacency matrices for learning undirected graphs representing conditional independence relationships in high dimensions, outperforming standard penalized Gaussian graphical model approaches in various regimes (Furmańczyk, 2023).
Model-Free Feature Selection and Network Analysis
The multivariate T extension and its estimator enable:
- Directional, scale-invariant variable selection in high-dimensional regression settings (Ansari et al., 2022, Ansari et al., 14 Mar 2025).
- Ranking and forward feature selection for multivariate outcomes, with no tuning parameters and explicit stopping rules.
- Directed network inference in financial, biological, and climatological data (Ansari et al., 2022).
7. Theoretical Limitations and Open Problems
- Under local parametric or minimax-detection boundary alternatives, the standard 1-NN estimator is asymptotically powerless unless graph construction is strengthened (increasing with ) (Shi et al., 2021).
- Weak continuity of fails under convergence in law, but holds under stricter Markov-product and copula-derivative types of convergence, implying care is needed in statistical inference (Ansari et al., 14 Mar 2025).
- In practical high-dimensional settings, the curse of dimensionality in nearest-neighbor search may be partially circumvented due to intrinsic dimension adaptivity, but further analysis on computational–statistical tradeoffs remains ongoing (Han et al., 2022).
References:
- (Lin et al., 2022) Limit theorems of Chatterjee's rank correlation
- (Zhang, 2022) On the asymptotic distribution of the symmetrized Chatterjee's correlation coefficient
- (Shi et al., 2021) On Azadkia-Chatterjee's conditional dependence coefficient
- (Tran et al., 3 Dec 2024) On a rank-based Azadkia-Chatterjee correlation coefficient
- (Huang et al., 8 Dec 2025) A multivariate extension of Azadkia-Chatterjee's rank coefficient
- (Ansari et al., 2022) A direct extension of Azadkia & Chatterjee's rank correlation to multi-response vectors
- (Han et al., 2022) Azadkia-Chatterjee's correlation coefficient adapts to manifold data
- (Deb et al., 2020) Measuring Association on Topological Spaces Using Kernels and Geometric Graphs
- (Huang et al., 2020) Kernel Partial Correlation Coefficient -- a Measure of Conditional Dependence
- (Furmańczyk, 2023) A construction of a graphical model
- (Ansari et al., 14 Mar 2025) On continuity of Chatterjee's rank correlation and related dependence measures
- (Deb et al., 20 Nov 2024) Distribution-free Measures of Association based on Optimal Transport