A new coefficient of correlation (1909.10140v4)

Published 23 Sep 2019 in math.ST, math.PR, and stat.TH

Abstract: Is it possible to define a coefficient of correlation which is (a) as simple as the classical coefficients like Pearson's correlation or Spearman's correlation, and yet (b) consistently estimates some simple and interpretable measure of the degree of dependence between the variables, which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other, and (c) has a simple asymptotic theory under the hypothesis of independence, like the classical coefficients? This article answers this question in the affirmative, by producing such a coefficient. No assumptions are needed on the distributions of the variables. There are several coefficients in the literature that converge to 0 if and only if the variables are independent, but none that satisfy any of the other properties mentioned above.

Citations (212)

View on Semantic Scholar

Summary

The paper introduces a novel coefficient that measures non-monotonic associations using a simple, rank-based approach.
It employs robust asymptotic theory to test independence efficiently with an O(n log n) computation time.
Its broad applicability and resistance to outliers make this coefficient a valuable tool for modern statistical analysis.

A New Coefficient of Correlation: An Analysis

The paper presented by Sourav Chatterjee offers a novel insight into correlation measurement by proposing a new coefficient that addresses deficiencies observed in traditional correlation coefficients such as Pearson's, Spearman's, and Kendall's $\tau$ . These classical measures, while effective for detecting linear or monotonic relationships, fail to capture non-monotonic associations. Chatterjee's work fills this gap by introducing a coefficient capable of reliably estimating the degree of dependence between variables, satisfying several desirable properties concurrently.

Conceptual Framework

The proposed correlation measure is appealing due to its simplicity and comprehensive properties:

Simplicity: The coefficient possesses a straightforward formula comparable to classical coefficients.
Interpretability: It provides an interpretable measure that ranges from 0, if and only if the variables are independent, to 1, if and only if one variable is a measurable function of the other.
Robust Asymptotic Theory: Under the independence hypothesis, the coefficient has a simple asymptotic theory, facilitating straightforward calculation of p-values.

The coefficient, denoted as $\xi_n$ , is based on rearranging data to facilitate a rank-based approach. For independent and identically distributed samples $(X_1, Y_1), \ldots, (X_n, Y_n)$ , the coefficient is computed using the ranks of $Y$ -values after sorting $X$ -values. A significant feature is its lack of symmetry, which helps focus on understanding if one variable is a function of another strictly, rather than bilateral dependency.

Strong Numerical Results and Claims

The paper establishes a theoretical foundation for this new correlation coefficient:

Consistency and Convergence: The coefficient converges to a deterministic limit under general conditions. This limit, denoted $\xi(X,Y)$ , is always between 0 and 1, affirming no assumptions on the variable distributions other than non-degeneracy of $Y$ . It reaches 0 iff $X$ and $Y$ are independent, and 1 iff $Y$ is a measurable function of $X$ .
Testing Independence: Asymptotic normality of the coefficient under independence is proven, with a variance of $2/5$ for continuous $Y$ . This allows for theoretical tests of independence without computationally expensive permutations.

Implications and Practical Considerations

The novel coefficient offers significant advantages for the correlation analysis landscape:

Broad Applicability: With no assumptions about distributions, it can be applied widely, accommodating both continuous and discrete variables.
Robustness: As a rank-based measure, it resists the influence of outliers and is invariant under monotonic transformations.
Computational Efficiency: Unlike many alternatives requiring quadratic time, it can be computed in $O(n\log n)$ time - a significant advantage for large sample sizes.

Empirical analysis using simulation studies demonstrates the coefficient's efficacy. Simulation results show it maintaining its diagnostic power at small sample sizes and validating its asymptotic properties.

Future Developments in AI

This work invites further exploration into adaptive correlation measures in artificial intelligence and data science where datasets often defy traditional statistical assumptions. Potential areas for development may include enhancing the flexibility of the coefficient to automatically adjust to sample peculiarities or integrating it with machine learning models to improve feature selection processes.

Conclusion

Chatterjee's new correlation coefficient stands as a significant contribution to modern statistical methods. By accommodating non-linear dependencies without additional assumptions on distributions, it enhances the toolkit available to statisticians and data scientists, particularly in domains where complex dependencies exist. The balance between theoretical rigor and computational simplicity makes it an accessible and powerful alternative to classical measures.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/RexDouglass/status/1778311964466524401

https://twitter.com/jgvfwstone/status/1778837989650506047

https://twitter.com/erc_bk/status/1779941454749602027

YouTube

Show All Videos