Efficient Transformer Encoders (ECO-M2F)

Updated 29 December 2025

The paper presents a multiplication-free attention mechanism using Laplacian kernels to replace dot-product multiplications, achieving significant energy savings.
It details dynamic encoder depth selection and progressive token scaling that adapt computation per input, reducing encoder GFLOPs by up to 52%.
Empirical benchmarks show ECO-M2F maintains or improves accuracy across NLP, bioinformatics, and vision tasks while lowering overall power consumption.

Efficient Transformer Encoders (ECO-M2F) comprise a class of techniques and architectures for reducing the computational and energy cost of transformer-based models in vision, language, and scientific domains. These approaches maintain or improve task accuracy while significantly reducing compute requirements. Prominent variants include: (1) multiplication-free attention mechanisms such as the Laplacian kernel-based ECO-M2F attention, (2) dynamic encoder-depth selection tailored to each input instance, and (3) progressive scaling of token sequence length throughout the encoding process. Modern encoder “toolkits” allow for further efficiency via segment-wise partial attention, selective computation, or architectural hybridization. This article details the main algorithmic designs, complexity analyses, implementation strategies, and benchmarking results underlying ECO-M2F approaches in recent research.

1. Multiplication-Free Attention: Mathematical Foundations and Algorithm

ECO-M2F replaces the conventional scaled dot-product attention in transformers with a multiplication-free Laplacian kernel operating over $L_1$ distance between projected queries and keys (Gao et al., 27 Jul 2025). Given input tokens $X \in \mathbb{R}^{N \times D}$ , the standard projections

$Q = X W^{(Q)},\quad K = X W^{(K)},\quad V = X W^{(V)}$

(where $W^{(Q)}, W^{(K)}, W^{(V)} \in \mathbb{R}^{D \times D_k}$ ) remain unchanged. The core replacement is the attention score: $s_{ij} = \exp\left(-\|q_i - k_j\|_1 / \tau\right)$ with normalization

$\alpha_{ij} = \frac{s_{ij}}{\sum_{\ell=1}^N s_{i\ell}},\qquad c_i = \sum_{j=1}^N \alpha_{ij} V_j$

where $\tau$ is a temperature parameter (commonly $\tau = \sqrt{D_k} / \lambda$ ).

This Laplacian-form “kernel convolution” eliminates all multiplications in the attention scoring stage, replacing them with absolute difference and addition operations. The multi-head extension reshapes $Q$ , $K$ , $V$ to shape $[H, N, D_k]$ for $H$ attention heads. The entire scoring-normalization-aggregation cycle per head is performed with only addition and absolute operations. The final linear mixing of heads reintroduces a matrix multiply but this stage is shared with all transformer architectures.

Pseudocode for ECO-M2F Multi-Head Attention (score computation is multiplication-free):

def ECO_M2F_MultiHeadAttention(X):
    # X: [N x D]
    Q_all = X @ W_Q  # [N x (H·D_k)]
    K_all = X @ W_K  # [N x (H·D_k)]
    V_all = X @ W_V  # [N x (H·D_k)]
    Q = reshape(Q_all, [H, N, D_k])
    K = reshape(K_all, [H, N, D_k])
    V = reshape(V_all, [H, N, D_k])
    C = zeros([H, N, D_k])
    for h in range(H):
        for i in range(N):
            score = []
            for j in range(N):
                d = sum(abs(Q[h, i, m] - K[h, j, m]) for m in range(D_k))
                score.append(exp(-d / tau))
            sum_w = sum(score)
            for m in range(D_k):
                C[h, i, m] = sum((score[j] / sum_w) * V[h, j, m] for j in range(N))
    C_cat = reshape(C, [N, H * D_k])
    return C_cat @ W_O

2. Complexity, Energy, and Hardware Implications

Both conventional and ECO-M2F attention have arithmetic complexity $\mathcal{O}(N^2 D_k)$ per layer. The crucial distinction is in the operation type:

Standard Attention: $N^2 D_k$ floating-point multiplications for score computation, plus $N^2 D_k$ additions.
ECO-M2F: All $N^2 D_k$ multiplications are replaced with absolute differences and additions.

Energy cost per operation (per Horowitz 2014):

32-bit FP multiply: $\approx 3.7 \ \text{pJ}$
32-bit FP addition: $\approx 0.9 \ \text{pJ}$

Theoretical module-level energy savings (by eliminating score multiplications) are $\sim 61\%$ within attention. Empirical inference measurements show ECO-M2F is within $10\text{–}20\%$ of dot-product latency on matrix-multiply-optimized hardware, but projected ASIC implementations could realize $>40\%$ wall-clock and $>50\%$ power reduction in the attention block (Gao et al., 27 Jul 2025).

Memory bandwidth requirements remain unchanged, as both approaches access $2ND_k$ values for $Q,K$ and write $N^2$ scores.

3. Dynamic Encoder Depth Selection in Mask2Former-Style Architectures

ECO-M2F for segmentation tasks (Mask2Former or “M2F”-style) introduces a three-step input-adaptive encoder truncation strategy (Yao et al., 23 Apr 2024):

Early-exit training: All encoder blocks are connected to the decoder head, allowing output after every layer. The loss is summed over all exits, weighted to encourage performance at both early and deep exits:

$\mathcal{L}_{\rm total} = \frac{1}{N} \sum_{i=1}^N \sum_{l=1}^L \gamma_l \mathcal{L}_i^l$

with $\gamma_l$ increasing in $l$ . Each layer is optimized to be a valid stopping point.

Derived dataset generation: For each training input $x_i$ , the "ideal" exit layer $d_i$ is determined by

$d_i = \arg\max_{1 \le l \le L} \left(q^l_i - \beta l\right)$

where $q^l_i$ is the per-exit quality metric (e.g., panoptic quality), and $\beta$ controls the compute-performance tradeoff.

Gating network training: A lightweight gating head predicts the exit layer from pooled backbone features:

$g(x_i) = W z_i \in \mathbb{R}^L,\quad z_i = \nu(F_1^i)$

Softmax over $L$ outputs yields per-layer probabilities; cross-entropy trains $W$ using $(x_i, d_i)$ .

At inference, the gating network selects a depth $l^\ast$ per sample, and only $l^\ast$ encoder layers are executed, producing a contextually adaptive tradeoff between speed and accuracy.

4. Progressive Token Length Scaling: Stage-wise Encoder Pruning

PRO-SCALE implements progressive scaling of token sequence length through the transformer stack (Aich et al., 23 Apr 2024). The Mask2Former encoder (which keeps multi-scale tokens concatenated at full length in all layers) is split into three stages:

Stage 1: Only the coarsest ( $s_4$ ) features ( $K_1$ tokens)
Stage 2: Coarse $s_4$ and medium-scale $s_3$ ( $K_2$ tokens, $K_2 = K_1 + (H/16) \cdot (W/16)$ )
Stage 3: Full multi-scale $s_2$ , $s_3$ , $s_4$ ( $K_3$ tokens)

Formally, for $p_1, p_2, p_3$ layers per stage, the total encoder FLOPs

$C_{\text{PS}} = p_1 \cdot C(K_1) + p_2 \cdot C(K_2) + p_3 \cdot C(K_3)$

are significantly reduced compared to the baseline $6 \cdot C(K_3)$ . For $(p_1, p_2, p_3) = (3,3,3)$ , encoder GFLOPs are reduced by $\approx 52\%$ compared to baseline Mask2Former, with total GFLOPs down by $\approx 27\%$ . Empirical results on COCO and Cityscapes confirm that, with suitable schedules, segmentation accuracy is maintained within $<0.5$ PQ of the baseline.

Auxiliary techniques such as Token Re-Calibration (TRC) and Light Pixel Embedding (LPE) further improve efficiency with minimal performance sacrifice.

5. Empirical Performance and Benchmarks

NLP, Bioinformatics, and Vision

Replacing scaled dot-product by ECO-M2F Laplacian kernel yields:

NLP tasks: Comparable or improved accuracy; e.g., SciQ $+0.019$ accuracy over baseline, StoryCloze $-0.010$ , HellaSwag $-0.0026$ , BoolQ $+0.0034$
Bioinformatics/Vision: Substantial gains: TCGA $+1.86\%$ , METABRIC $+5.5\%$ , VDJdb $+3.2\%$ , CIFAR-10 $+1.06\%$ (Gao et al., 27 Jul 2025)
Energy cost: Attention module power usage reduced by $\sim60\%$ in theory, $20\%$ end-to-end measured (GPU), projected $>40\%$ for future ASICs

Segmentation

ECO-M2F dynamic depth and PRO-SCALE both deliver substantial cost savings:

Model	PQ	Total GFLOPs	Encoder GFLOPs	$\Delta$ PQ	$\Delta$ Enc GFLOPs
Mask2Former	52.03	234.5	117.0	–	–
PRO-SCALE (3,3,3)	52.82	171.7	56.18	+0.79	–52.0%
ECO-M2F dynamic depth	PQ change negligible for 20–30% reduction in encoder FLOPs (Yao et al., 23 Apr 2024)

PRO-SCALE and ECO-M2F techniques generalize to detection (e.g., DINO with Res50, $\approx30\%$ encoder GFLOPs reduction, $+0.4$ AP improvement).

6. Integration with Other Efficiency Strategies

ECO-M2F attention is a direct drop-in replacement for the softmax+scaled dot-product operation in any transformer encoder or decoder; all linear projection layers and residual/MLP blocks remain unchanged (Gao et al., 27 Jul 2025). The approach composes with:

Delayed interaction layers: Partial/local self-attention in early layers, global attention in the final layers, further reducing $\mathcal{O}(n^2)$ complexity in multi-segment applications such as open-domain QA (Siblini et al., 2020).
Sparse/low-rank attention: ECO-M2F’s Laplacian kernel can replace the dot-product kernel in sparse/approximate attention schemes (Longformer, Linformer, Performer).
Dynamic depth/truncation: Depth gating, as in segmentation, may be adopted for text or multimodal transformers with similar early-exit or adaptive layer-count strategies.
Progressive token scaling: Sequence shortening or segmentation can be staged throughout the encoder for further quadratic cost reduction.

These strategies can be compounded to match accuracy requirements and hardware constraints, with some requiring only minor retraining of lightweight auxiliary modules (e.g., gating head).

7. Practical Implementation and Trade-offs

Adoption of ECO-M2F requires only minimal modifications to transformer codebases: swapping the attention score computation for the Laplacian convolutional kernel, and, for dynamic depth methods, incorporating the gated early-exit mechanism and associated loss terms.

A single tunable parameter $\beta$ in dynamic depth methods yields a Pareto frontier of compute vs. task performance; efficient adaptation to reduced budgets is possible by retraining the gating module only. PRO-SCALE’s adjustment of stage widths and depths balances early coarse-scale computation with progressively full multi-scale attention layers.

Measured on standard benchmarks and public datasets, all ECO-M2F methods exhibit:

Encoder cost reductions ranging from 20%–52%
Minimal impact on core segmentation, classification, or detection metrics
Adaptation potential to different backbone architectures and application domains

References

EcoTransformer: Attention without Multiplication (Gao et al., 27 Jul 2025)
Efficient Transformer Encoders for Mask2Former-style models (Yao et al., 23 Apr 2024)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation (Aich et al., 23 Apr 2024)
Delaying Interaction Layers in Transformer-based Encoders for Efficient Open Domain Question Answering (Siblini et al., 2020)