Papers
Topics
Authors
Recent
Search
2000 character limit reached

XicorAttention: Rank Correlation in Transformers

Updated 19 January 2026
  • The paper introduces XicorAttention, which leverages a differentiable version of Chatterjee’s rank correlation coefficient to replace dot-product attention in Transformers.
  • XicorAttention employs continuous relaxations via SoftSort and FastSoftRank, achieving up to a 9.1% MAE reduction on time series forecasting benchmarks.
  • Integrating XicorAttention into Transformer models provides robust performance improvements for capturing complex nonlinear dependencies, despite moderate computational overhead.

Nonlinear Rank Correlation-Based Attention (XicorAttention) is an attention mechanism constructed on Chatterjee’s rank correlation coefficient ξ\xi, designed to replace the standard dot-product attention in Transformers with a differentiable measure of nonlinear dependency between query and key pairs. XicorAttention leverages rank correlation to capture complex relationships in data, particularly for time series forecasting, where nonlinear structure and regime shifts are prevalent and often unaddressed by conventional attention measures. Its key innovation is the replacement of the standard inner-product kernel with a soft (differentiable) version of Chatterjee's coefficient, computed via relaxations of sorting and ranking to ensure gradient flow. Empirically, XicorAttention achieves significant performance improvements on standard forecasting benchmarks, with up to 9.1% MAE reduction over baselines, at a moderate computational overhead (Kimura et al., 3 Jun 2025).

1. Chatterjee’s Rank Correlation Coefficient and Theoretical Foundations

Chatterjee’s rank correlation coefficient ξ\xi is a rank-based statistic for measuring general (including nonlinear) dependence between two variables. For a pair of random variables (X,Y)(X, Y) with joint CDF F(x,y)F(x, y) and marginals FX(x)F_X(x), FY(y)F_Y(y), the population correlation is defined as

ξ(X,Y)=VarX[p(yX)]dFY(y)Var[I{Yy}]dFY(y),\xi(X, Y) = \frac{\int \operatorname{Var}_X[ p(y \mid X) ] \, dF_Y(y)}{\int \operatorname{Var}[ I\{Y \geq y\} ] \, dF_Y(y)},

where p(yX):=P(YyX)p(y \mid X) := \mathbb{P}(Y \geq y \mid X).

Given data {(Xi,Yi):i=1,,n}\{(X_i, Y_i): i = 1, \dots, n\}, the finite-sample estimator under no ties is

ξn(X,Y)=13n21i=1n1ri+1ri,\xi_n(X, Y) = 1 - \frac{3}{n^2 - 1} \sum_{i=1}^{n-1} \left| r_{i+1} - r_i \right|,

where rir_i is the rank of Y(i)Y_{(i)} after sorting X(1)<<X(n)X_{(1)} < \cdots < X_{(n)} (Shi et al., 2020, Kimura et al., 3 Jun 2025).

Chatterjee’s ξ\xi equals zero if and only if XX and YY are independent. It is a consistent estimator for general dependence and is distribution-free under the null. However, its local power is rate-suboptimal: under weak dependencies (Θ(1/n)\Theta(1/\sqrt{n})), ξn\xi_n does not exhibit power exceeding the nominal level, contrasting with alternatives such as Hoeffding’s DD, Blum-Kiefer-Rosenblatt’s RR, and Bergsma-Dassios-Yanagimoto’s τ\tau^*, which are rate-optimal for such alternatives (Shi et al., 2020).

2. Differentiable Relaxations: SoftSort and FastSoftRank

The original computation of ξn\xi_n uses non-differentiable operations (sorting and ranking), which are not compatible with standard neural network backpropagation. XicorAttention resolves this via continuous relaxations:

  • SoftSort: Given a vector qRn\mathbf{q}\in\mathbb{R}^n, SoftSort computes a doubly-stochastic soft permutation matrix via a row-wise softmax:

P^=SoftSortτ(q)=softmaxrow(d(sort(q)1,1q)τ),\widehat{P} = \mathrm{SoftSort}_\tau(\mathbf{q}) = \mathrm{softmax}_{\text{row}}\left( -\frac{d(\operatorname{sort}(\mathbf{q})\,\mathbf{1}^\top,\, \mathbf{1}\,\mathbf{q}^\top)}{\tau} \right),

where d(,)d(\cdot,\cdot) is a smooth distance (typically 1\ell_1), and τ>0\tau > 0 is a temperature.

  • FastSoftRank: The ranking operator is approximated via strongly convex isotonic regression on the permutahedron. For input k\mathbf{k}, the soft rank is given by

rΦε(k)=argminyPerm(ρ)12y+kε2,r^\varepsilon_\Phi(\mathbf{k}) = \arg\min_{y \in \text{Perm}(\rho)} \frac12 \| y + \frac{\mathbf{k}}{\varepsilon} \|^2,

where ρ=(n,n1,,1)\rho = (n, n-1, \ldots, 1), and ε\varepsilon controls regularization.

Both SoftSort and FastSoftRank are O(nlogn)O(n \log n) per sample, ensuring practical applicability for Transformer head dimensions (dd).

3. XicorAttention Mechanism

XicorAttention reconceptualizes attention weights as nonlinear rank correlations between projected queries and keys:

  1. For an input XRT×C\mathbf{X} \in \mathbb{R}^{T \times C}, obtain queries, keys, and values via linear maps, then split into hh heads of dimension dd.
  2. For each head, and for every (i,j)(i,j) pair of query/key positions, consider q=Qi,:m\bm{q} = \mathbf{Q}^m_{i,:}, k=Kj,:m\bm{k} = \mathbf{K}^m_{j,:} as vector samples of size n=dn = d.
  3. Apply SoftSort to q\bm{q}, permute both q\bm{q} and k\bm{k}, and obtain differentiable soft ranks of k\bm{k}.
  4. Compute the relaxed Chatterjee coefficient:

ξ^d(q,k)=13d21=1d1r^+1r^,\widehat\xi_d(\bm q, \bm k) = 1 - \frac{3}{d^2-1} \sum_{\ell=1}^{d-1} |\widehat r_{\ell+1} - \widehat r_\ell|,

yielding an attention score in [0,1][0,1].

  1. Compile these into an attention matrix Am[0,1]T×T\mathcal{A}^m \in [0,1]^{T \times T} and output:

XicorAttnm(Qm,Km,Vm)=AmVm.\mathrm{XicorAttn}^m(\mathbf{Q}^m, \mathbf{K}^m, \mathbf{V}^m) = \mathcal{A}^m \mathbf{V}^m.

Unlike standard attention, no additional softmax is applied.

Pseudocode (single head):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
INPUT: Q  R^{T×d}, K  R^{T×d}, V  R^{T×d}, temperature τ, regularizer ε
OUTPUT: Y  R^{T×d}

for i in 1..T:
  for j in 1..T:
    q = Q[i,:]
    k = K[j,:]
    # 1) soft-sort q, permute k
    Psoft = SoftSort_τ(q)         # (d×d)
    q_sorted = Psoft @ q
    k_sorted = Psoft @ k
    # 2) soft-rank k_sorted
    r_soft = FastSoftRank_ε(k_sorted) # vector in R^d
    # 3) compute soft-xi
    ξ̂ = 1 - 3/(d^2-1) * sum(|r_soft[ℓ+1] - r_soft[ℓ]| forin 1..d-1)
    A[i,j] = ξ̂
Y = A @ V
(Kimura et al., 3 Jun 2025)

4. Transformer Integration and Empirical Evaluation

XicorAttention is integrated into standard time series Transformer architectures by replacing the multi-head self-attention dot-product block. Embedding, positional encoding, feed-forward networks, and output projections remain unaltered. The method is implemented in several state-of-the-art forecasting backbones, including PatchTST, iTransformer, TimeXer, Informer, Autoformer, FEDformer, and vanilla Transformer (Kimura et al., 3 Jun 2025).

Empirical evaluation spans six real-world multivariate time-series datasets (ETTh1/2, ETTm1/2, Exchange, Weather, Electricity, Traffic), using mean absolute error (MAE) and mean squared error (MSE) as metrics. XicorAttention consistently improves forecasting performance, achieving up to 9.1% MAE reduction on the Exchange dataset, with 3–5% average gains across all datasets and lookaheads. It achieves best absolute MAE/MSE on 5 out of 6 benchmarks.

Ablation studies on head dimension dd indicate that performance increases until d128d \approx 128, beyond which gains plateau; thus, a per-head dimension of at least 128 is recommended. Main computational overhead arises from SoftSort (O(d2)O(d^2)); empirically, XicorAttention is about 1.5×\times slower than the standard attention mechanism for moderate TT (Kimura et al., 3 Jun 2025).

5. Computational Complexity, Differentiability, and Comparative Analysis

The standard attention mechanism incurs O(T2d)O(T^2 d) complexity per layer. In contrast, XicorAttention incurs O(T2d2)O(T^2 d^2) per head, driven by SoftSort (O(d2)O(d^2)) and FastSoftRank (O(dlogd)O(d \log d)) operations. End-to-end differentiability is preserved via analytic gradients through both soft permutation and soft ranking steps.

Comparison to other independence measures yields the following:

Statistic Null Distribution Local Power Complexity
ξn\xi_n N(0,2/5n)N(0, 2/5n) Suboptimal O(nlogn)O(n \log n)
DnD_n Non-degenerate Optimal O(nlogn)O(n \log n)
RnR_n Non-degenerate Optimal O(nlogn)O(n \log n)
τn\tau^*_n Non-degenerate Optimal O(nlogn)O(n \log n)
ξn\xi_n^* Degenerate (null) Consistent O(n5/3)O(n^{5/3})

Chatterjee's ξn\xi_n is distribution-free under the null but demonstrates low sensitivity to dependencies of order O(1/n)O(1/\sqrt{n}). For strong nonlinear signals, as often encountered in time series forecasting, this limitation is less impactful, justifying the use of XicorAttention for moderate to strong dependencies. The earlier Dette–Siburg–Stoimenov kernel estimator ξn\xi_n^* is harder to tune and computationally less favorable (Shi et al., 2020).

6. Practical Recommendations and Limitations

  • XicorAttention is suited for scenarios where detection of nonlinear dependence is crucial, and moderate-to-strong signals are expected.
  • The primary performance gain derives from the use of Chatterjee’s ξ\xi as a principled, permutation-invariant, and nonlinear similarity, distinguishing it from inner-product attention.
  • In regimes where only very weak dependencies (1/n\sim 1/\sqrt{n}) are present, alternative U-statistic-based attention mechanisms may be preferred for local power optimality (Shi et al., 2020).
  • Recommended head dimensions are d128d \ge 128. Model parameters such as SoftSort temperature (τ=1.0\tau = 1.0) and FastSoftRank regularization (ε=0.01\varepsilon = 0.01) are used in the original implementation.
  • Future optimization may include more efficient differentiable sorting techniques to mitigate the quadratic overhead of SoftSort.

7. Conceptual Impact and Relation to Broader Research

XicorAttention contributes a new hybrid between classical nonparametric dependence measures and modern self-attention architectures. It demonstrates that differentiable rank-based statistics can serve as effective plug-ins for neural attention, enabling detection of a broader spectrum of statistical dependencies. The dual requirements of computational tractability and differentiability are addressed via continuous relaxations, connecting algorithmic developments in statistical dependence testing (Shi et al., 2020) with deep learning advances (Kimura et al., 3 Jun 2025). This approach highlights the possibility of adopting a wide variety of statistical kernels in neural models, subject to constraints on local power and computational efficiency. A plausible implication is the potential for further exploration of rank- or kernel-based attention mechanisms in settings beyond time series forecasting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nonlinear Rank Correlation-Based Attention (XicorAttention).