Token-Grid Correlation Module

Updated 21 November 2025

Token-Grid Correlation Module is a structured approach that models dual-axis dependencies via bidirectional attention, capturing both temporal and variate-wise patterns.
It leverages horizontal (temporal) and vertical (variate-aware) attention mechanisms to directly exploit local and global relationships while reducing computational complexity.
Empirical results in GridTST and multimodal document compression demonstrate state-of-the-art performance and efficiency gains across diverse benchmarks.

The Token-Grid Correlation Module is a structurally principled approach for modeling complex dependencies in structured data using Transformer architectures. It exploits the inherent two-dimensionality present in domains where tokens can be semantically organized along dual axes—most notably, multivariate time series (time × variate grid) and, in a distinct context, high-resolution visual documents (patch × semantic content grid). This construct unifies bidirectional attention mechanisms or correlation-driven token processing to jointly capture inter-token dependencies along both axes, thus overcoming trade-offs inherent to single-axis encoding.

1. Multivariate Grid Representation and Slicing

In time series forecasting, the Token-Grid Correlation Module was formalized in the GridTST model, where multivariate time series $X \in \mathbb{R}^{T\times N}$ are interpreted as a two-dimensional grid: time steps along the $x$ -axis and variates (channels) along the $y$ -axis. Each univariate series $X_{:,n}$ undergoes patch segmentation via a sliding window of length $P$ and stride $S$ , yielding $M=\lceil (T-P)/S\rceil + 2$ patches per variate. The resulting $X_{p,n} \in \mathbb{R}^{M\times P}$ are linearly projected and positionally encoded:

$X_{d,n} = X_{p,n} W_p + W_{pos}, \qquad X_{d,n} \in \mathbb{R}^{M\times D}$

Stacking over all $N$ variates produces a 3-D grid $X_d \in \mathbb{R}^{M\times N \times D}$ . Horizontal slices $X_{d,t,:} \in \mathbb{R}^{N\times D}$ yield variate tokens, while vertical slices $X_{d,:,n} \in \mathbb{R}^{M\times D}$ yield time tokens (Cheng et al., 22 May 2024).

This bidirectional grid conception allows subsequent attention mechanisms to directly exploit local and global dependencies across both axes.

2. Bidirectional Attention: Horizontal and Vertical Mechanisms

The core of the module is the two orthogonal attention schemes:

Horizontal (Temporal) Attention: For each variate $n$ , attend over the sequence of $M$ patches $X_{d,:,n}$ using multi-headed self-attention. Each head forms queries, keys, and values from $X_{d,:,n}$ and computes:

$O_{h,:,n} = \mathrm{Softmax}\left( \frac{Q_{h,:,n} K_{h,:,n}^\top}{\sqrt{d_k}} \right) V_{h,:,n} \in \mathbb{R}^{M\times D}$

Post-attention residual, BatchNorm, and feed-forward sublayers yield updated temporal representations for each variate.

Vertical (Variate-Aware) Attention: For each temporal patch $t$ , apply multi-headed attention across all variates $X_{d,t,:} \in \mathbb{R}^{N\times D}$ , constructing $Q, K, V$ with distinct projection matrices. The result

$\hat{O}_{h,t,:} = \mathrm{Softmax}\left( \frac{\hat{Q}_{h,t,:} \hat{K}_{h,t,:}^\top}{\sqrt{d_k}} \right) \hat{V}_{h,t,:} \in \mathbb{R}^{N\times D}$

encodes instantaneous inter-variate correlation per patch.

An encoder layer consists of vertical attention followed by horizontal attention with intervening normalization and residual connections. Empirical analysis indicates that the "channel-first" order optimizes forecasting performance in GridTST (Cheng et al., 22 May 2024).

3. Complexity, Parameterization, and Variate Sampling

Attention costs are divided between horizontal and vertical mechanisms. For $n=M$ (patches) and $m=N$ (variates), the per-layer complexity shifts from a standard single-axis $O(n^2 D)$ to $O(n^2 D/2 + m^2 D/2)$ . For datasets with relatively small $N$ , this yields significant computational savings.

The dual-attention design does increase parameter count by requiring an additional set of projection matrices. However, these share size with their temporal counterparts and induce only a minor constant overhead.

For high-dimensional datasets (large $N$ ), variate sampling restricts vertical attention to a randomly selected subset of channels per batch, reducing memory and latency up to 3× without significant loss in predictive accuracy (Cheng et al., 22 May 2024).

4. Tradeoffs in Representation: Time-Centric vs. Variate-Centric

Single-view Transformer approaches traditionally collapse $N$ variates into a single embedding per timestep ("time token"), causing loss of covariate structure, or treat each channel as a series-wide "variate token," leading to loss of fine temporal fidelity. The Token-Grid Correlation Module—by simultaneously learning across both token axes—resolves this dichotomy:

Horizontal attention models trends and temporal dependencies per variate.
Vertical attention captures instantaneous couplings among variables.

Empirically, this architecture achieves state-of-the-art MSE/MAE on 26 out of 28 forecasting tasks, consistently outperforming unidirectional paradigms (e.g., PatchTST, iTransformer), especially with increased lookback windows (Cheng et al., 22 May 2024).

Dataset	Channels (N)	Horizon	GridTST MSE	PatchTST MSE	iTransformer MSE
Weather	21	336	0.243	0.247	0.255
Traffic	862	96	0.337	0.366	0.358
Electricity	321	720	0.186	0.210	0.207

This suggests the bidirectional module effectively scales predictive accuracy as data histories lengthen.

5. Patch Segmentation and Local Semantics

Segmenting each variate’s temporal axis into patches confers two operational benefits:

Quadratic Time Complexity Reduction: Self-attention complexity decreases from $O(T^2 D)$ to $O(M^2 D)$ with $M \approx T/S$ , a substantial saving for long input sequences.
Localized Semantic Enrichment: Each patch encodes contiguous local patterns, augmenting the representation with richer subseries-level descriptors compared to single timepoint tokens.

Patch size $P$ and stride $S$ define the granularity of semantic capture and attention windowing, directly affecting both inductive bias and efficiency (Cheng et al., 22 May 2024).

6. Token Correlation in Document Understanding: Compression by Redundancy Mining

Adaptation of the core correlation principle extends to multimodal document understanding via token-level correlation-guided compression (Zhang et al., 19 Jul 2024). Here, the module constructs a patch–patch cosine-similarity matrix in $\mathbb{R}^{N \times N}$ using normalized key vectors from a CLIP-ViT vision encoder. By thresholding on correlation magnitude and neighbor count, it measures redundancy $r$ and information density $d$ :

$d = 1 - r = 1 - \frac{N_R}{N}$

CLS–Patch attention distributions (from shallow and deep layers) guide both local (sampling proportional to $A_L$ ) and global (selecting outlier tokens by attention-weight IQR) mining of informative visual tokens. Selected tokens, optionally merged with nearest neighbors, form the compressed set.

This compression module

Is parameter-free and plug-and-play;
Achieves up to 34% reduction in token count and $1.3-1.5\times$ processing speedups;
Maintains competitive accuracy on 10 benchmarks, outperforming rigid fixed-rate approaches such as PruMerge, especially when combined with brief fine-tuning (Zhang et al., 19 Jul 2024).

7. Empirical Results and Interpretations

In time series forecasting, the Token-Grid Correlation Module empowers GridTST to outperform prior SOTA approaches and sustain improvement as lookback window $T$ increases (demonstrating enhanced absorption of long-term temporal and variate information). In document understanding, correlation-guided compression adapts token filtering dynamically to each instance and dataset, as confirmed by adaptive histograms and box plots contrasting fixed and adaptive methods.

A plausible implication is that token-grid correlation principles—by exploiting the full 2D structure of transformer-native embeddings—constitute a general design paradigm for efficient, scalable, and information-preserving sequence modeling across temporal, visual, and multimodal domains (Cheng et al., 22 May 2024, Zhang et al., 19 Jul 2024).