TEAFormer: Translation Equivariance in Transformers

Updated 8 July 2025

TEAFormer is an advanced neural architecture that enforces translation equivariance using slide indexing and adaptive mechanisms to ensure shift-consistent outputs.
It employs component stacking and parallel branches to balance local detail with global context, enhancing computational efficiency and training convergence.
TEAFormer achieves superior performance in image restoration by maintaining strict spatial alignment, outperforming traditional transformers that lack inherent translation invariance.

The Translation Equivariance Adaptive Transformer (TEAFormer) refers to a class of neural network architectures, particularly within transformer models, that incorporate translation equivariance as an explicit inductive bias. This property ensures that a translated input yields a correspondingly translated output, a principle especially important for dense prediction tasks such as image restoration, where precise spatial alignment between input and output is required. TEAFormer architectures have been developed to address the breakdown of translation equivariance in standard transformers, especially those employing attention mechanisms with absolute positional encodings or global feature mixing, and to reconcile the need for both large receptive fields and efficient computation in deep vision models (2506.18520).

1. Motivation for Translation Equivariance in Transformers

Translation equivariance is the property that a function $f$ satisfies $f(T_\tau(x)) = T_\tau(f(x))$ for an input $x$ and translation operator $T_\tau$ , i.e., shifting the input by $\tau$ shifts the output by $\tau$ . Convolutional neural networks (CNNs) implement this property natively due to weight sharing across spatial positions, making them highly effective for spatially structured tasks. However, transformers, especially those with global self-attention and absolute positional encoding, often sacrifice this property. When applied to image restoration or tasks requiring precise spatial correspondence, such loss of translation equivariance can lead to output inconsistencies under spatial transformations, impaired generalization, and slowed training convergence (2506.18520).

TEAFormer architectures are designed to restore and adapt translation equivariance to transformer-based image and signal processing pipelines without relinquishing the modelling power and long-range context capabilities that attention offers.

2. Core Strategies for Translation Equivariance

TEAFormer incorporates two principal strategies to achieve translation equivariance in attention-based architectures:

2.1 Slide Indexing

Slide indexing is a mechanism whereby operator responses are indexed by fixed relative window positions rather than by absolute positions or tokens. In practice, this involves constructing outputs by aggregating information from a window around each query position that slides over the input with the same stride, regardless of input translation. The simplest instance of slide indexing is standard sliding window attention: for each position, attention is computed only within a local window centered on it, with the set of indices invariant to shifts of the input.

Mathematically, let $O_i$ denote the output at position $i$ and $W$ be an attention window with fixed offsets. Then for input $X$ ,

$O_i = \sum_{j \in W} \alpha_{i,j} V(X_{i+j})$

where $\alpha_{i,j}$ are attention weights. If the input is translated by $\tau$ , all indices shift by $\tau$ , but the relative window structure is preserved, ensuring $O_{i+\tau}(X_{(\cdot+\tau)}) = O_i(X)$ .

2.2 Component Stacking

Component stacking refers to assembling multiple translation-equivariant operators either in series (composition) or in parallel (summation). The design guarantees that the resulting architecture is equivariant, as both the composition and sum of equivariant operators remain equivariant. For example, stacking several slide-indexed attention blocks, possibly interleaved with equivariant convolutions, does not break translation equivariance. This property enables construction of deep and expressive architectures while maintaining strict translation correspondence between input and output (2506.18520).

3. Adaptive Sliding Indexing Mechanism

A limitation of strict slide indexing—such as using only fixed local windows—is the trade-off between receptive field size and computational complexity: large windows confer global context but incur quadratic cost, while small windows limit expressivity. TEAFormer introduces an adaptive sliding indexing mechanism that adaptively selects the set of key-value pairs for each query within a local window, based on features extracted by small convolutional kernels. These adaptive indices allow dynamic, data-dependent reordering or selection of local features, while still preserving equivariance since the underlying mechanism is built from translation-equivariant convolutions.

To further balance local detailed context and global information, TEAFormer introduces a parallel branch: Downsampled Self Attention (DSA), which aggregates global information at lower resolution using translation-equivariant pooling. The two branches—adaptive sliding (local) and downsampled (global)—are concatenated or aggregated via learnable weights (2506.18520):

$\mathrm{TEA}(X) = \alpha_s \cdot \mathrm{ASkvSA}(X) + \alpha_d \cdot \mathrm{DSA}(X)$

where $\mathrm{ASkvSA}$ denotes adaptive slide-indexed self-attention, and $\mathrm{DSA}$ is downsampled self-attention.

4. TEAFormer Network Architecture

The TEAFormer network typically integrates the above mechanisms into a hierarchical architecture as follows:

Shallow feature extraction: An initial convolutional layer extracts low-level spatial features, maintaining translation equivariance.
Deep feature extraction: The main body consists of multiple Translation Equivariance Groups (TEGs), each containing several Translation Equivariance Blocks (TEBs), where each TEB includes one TEA module (as defined above) and a feed-forward layer.
Aggregation and upsampling: The network head integrates multi-scale features, optionally using upsampling or residual connections, to generate restored outputs that are guaranteed to shift in correspondence with the input.

Parallel and serial arrangements of TEA modules and equivariant feed-forward pathways enable complex representational capacity while maintaining the desired shift-invariance at every layer.

5. Performance and Empirical Evaluation

TEAFormer demonstrates improved effectiveness in image restoration benchmarks by delivering superior spatial consistency, better generalization, and higher training convergence rates than previous transformer or CNN-based baselines. Key empirical observations, as reported, include:

Higher quantitative metrics: TEAFormer achieves substantial improvements in PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index) on super-resolution and denoising tasks compared to non-equivariant transformer models.
Faster training convergence: The architecture shows lower Neural Tangent Kernel (NTK) condition numbers, indicating more stable and rapid convergence.
Generalization: TEAFormer generalizes better to unseen spatial degradations, owing to its strict enforcement of translation equivariance across all layers.
Computational efficiency: By combining local adaptive attention and global downsampling in parallel, TEAFormer maintains linear complexity in the number of tokens, similar to sliding window attention approaches, while retaining access to large effective receptive fields.

6. Theoretical Properties and Design Guarantees

The design of TEAFormer is underpinned by provable mathematical properties:

Closure of translation equivariance under sum and composition: Proven formally, the sum and (function) composition of translation-equivariant operators are themselves translation equivariant, ensuring complex assemblies remain equivariant if constructed from such modules.
Equivariance of adaptive sliding: Since the adaptive indexing in ASkvSA is implemented by convolutional operators, which are themselves translation equivariant, the entire adaptive selection remains shift-consistent.
End-to-end equivariance: All modules—attention, skip connections, and feed-forward layers—are constructed or constrained so that any spatial translation of the input is matched exactly by the same translation in the output.

7. Relation to Other Approaches and Applicability

Unlike some earlier efforts that incorporate equivariance via group convolution (as in CubeNet (1804.04458), Harmonic Networks (1612.04642), or Harmformer (2411.03794)) or through relative positional encodings in attention (as in translationally equivariant kernelizable attention (2102.07680)), TEAFormer’s innovation lies in (1) systematically replacing all position-sensitive mechanisms with slide-indexed and translation-equivariant components, and (2) adaptively balancing local and global context using parallel branches with learnable fusion.

TEAFormer is particularly well-suited for dense prediction, image restoration, and any vision domain where the preservation of translation structure is critical—such as super-resolution, denoising, inpainting, and pixel-accurate segmentation. By restoring this key inductive bias to attention-based models, TEAFormer bridges the gap between the detail-fidelity of convolutions and the context-modelling power of transformers (2506.18520).

In summary, the Translation Equivariance Adaptive Transformer (TEAFormer) is characterized by slide indexing and component stacking strategies, an adaptive sliding indexing mechanism, and parallel aggregation of local and global attention. These properties collectively enable strictly translation-equivariant transformer architectures that excel in image restoration, overcoming the limitations of standard attention-based models lacking such inductive bias (2506.18520).