Attention-based Double Compression (ADC)

Updated 20 September 2025

Attention-based Double Compression (ADC) is a technique that uses attention mechanisms to compress data along two axes, enhancing semantic retention.
It employs a two-stage process where activations are first clustered using attention scores and then refined by selecting top tokens or features.
ADC is applied in split learning, model pruning, and image restoration, achieving significant gains in communication efficiency and computational cost reduction.

Attention-based Double Compression (ADC) refers to methods leveraging attention mechanisms, often in deep neural architectures, to realize compression along two distinct, often orthogonal, dimensions. Recent research has formalized ADC in the context of communication-efficient split learning, model-based compression, post-processing of image compression artifacts, and context-sensitive lossless coding. Across these domains, ADC exploits learned attention for identifying, merging, and retaining only the most semantically meaningful representations while discarding redundancy or less informative elements.

1. ADC Frameworks and Core Principles

ADC encompasses frameworks wherein attention mechanisms are deployed to guide structured compression, frequently in two stages. In split learning for Vision Transformers (“ViTs”), ADC implements (i) batch-wise compression via merging activations according to similarity in attention distributions, and (ii) token-wise compression via selection of the most informative tokens, typically evaluated through softmax-normalized attention scores of the class token (CLS) (Alvetreti et al., 18 Sep 2025). In model compression, ADC may refer to reinforcement learning agents that sequentially attend to layer-specific attributes, determining fine-grained sparsity ratios and applying pruning/reduction at structural and computational levels (Hakkak, 2018). In post-compression artifact removal, attention-enhanced networks adaptively recalibrate channel/spatial information to restore fidelity (Xue et al., 2019).

A common thread is the dual application of attention: first to identify or group high-importance regions (across samples, layers, or spatial/feature channels), and second to further distill representations by retaining only those components most pertinent to the downstream task or objective.

2. Compression Methodologies Leveraging Attention

ADC in split learning of ViTs (Alvetreti et al., 18 Sep 2025) employs a two-stage compression pipeline:

Batch Compression: Activations $z \in \mathbb{R}^{n \times d}$ per sample are characterized by CLS attention scores. Samples are grouped into $T$ clusters via K-means over these scores. Assignment indicator $\mathbb{1}(z, i)$ directs samples to their closest centroid $C_j$ according to:

$\mathbb{1}(z, i) = \begin{cases} 1 & \text{if } i = \arg\min_j \|CLS\_score(z) - C_j\|^2 \ 0 & \text{otherwise} \end{cases}$

Averaged cluster activations and labels result in compressed, class-agnostic super-samples.

Token Selection: Within each merged activation, the top- $k$ tokens (those with highest avg. attention scores) are retained, those below the threshold are discarded.

This two-stage compression achieves an overall ratio of

$\xi_{ADC} = \frac{T}{B} \cdot \frac{k}{n}$

where $B$ is original batch size and $n$ is tokens per sample.

For deep model compression, ADC agents (see (Hakkak, 2018)) observe a vector of layer characteristics, output a continuous-valued compression ratio $a \in (0, 1]$ per layer, and pursue a reward balancing error and computation via

$R_{FLOPS} = -\text{Error} \times \log(\text{FLOPs})$

where FLOPs measures computational cost.

Post-processing CNNs for image compression utilize combined channel and spatial attention mechanisms within residual blocks to recalibrate feature maps, optimizing for both pixel-wise MAE and multiscale structure similarity (MS-SSIM) (Xue et al., 2019).

3. Attention Scoring and Selection Mechanisms

The attention engine is typically instantiated as a score computed from self- or cross-attention modules. In Vision Transformers (Alvetreti et al., 18 Sep 2025), attention for each token is defined as the mean softmax output across all heads from the CLS query:

$CLS\_score(z) \approx \text{average}_{\text{heads}} \left( \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) \right)$

These scores underpin both clustering for batch compression and selection for token reduction.

Channel and spatial attention (see (Xue et al., 2019)) compute weights per feature channel or spatial region, multiplying the original features by attention-derived importance coefficients:

Channel attention: $y_c = f(x_c) \cdot w_c$ .
Spatial attention: $y_{s}(i,j) = x_{s}(i,j) \cdot w_{s}(i,j)$ .

In Bayesian Attention Networks (BAN) for data compression (Tetelman, 2021), attention factors $\rho_i(s)$ are derived from loss correlations and incorporated into Bayesian predictive probabilities:

$P(s|\{s_i\}, H) = \int_{w} P(s|w) \prod_{i} e^{-\rho_i(s) l(s_i|w)} P_0(w|H) dw$

4. Impact on Communication Efficiency, Accuracy, and Model Dynamics

ADC frameworks demonstrate substantial reductions in data transmission and computational requirements. In split learning (Alvetreti et al., 18 Sep 2025), forward and backward communication costs are cut by the compound factor $\xi_{ADC}$ , enabling efficient training with minimal symbols sent, while preserving baseline accuracy. For VGG-16 on ImageNet, reinforcement learning-based ADC achieves a 4-fold reduction in FLOPs and a 2.8% improvement in accuracy over manual compression policies (Hakkak, 2018). In post-processing for traditional codecs, attention-based CNNs deliver improvements in image fidelity, e.g., a 0.64 dB PSNR gain at 0.15 bpp (Xue et al., 2019).

ADC methods inherently compress gradients alongside activations, obviating need for extra tuning in distributed learning setups. The class-agnostic merging enabled via attention-driven batch compression facilitates generalization and prevents loss in task performance by virtue of merging on semantic importance rather than label identity.

5. Comparative Evaluation and Empirical Findings

Benchmarks for ADC against state-of-the-art baselines (BottleNet++, Top-K, RandTopK, C3-SL) establish its communication efficiency and stability. When applied to DeiT-T training on CIFAR100, ADC outperforms other methods at compression ratios down to $\xi_{ADC} \approx 0.1$ (Alvetreti et al., 18 Sep 2025), maintaining accuracy close to uncompressed models. ADC also yields smoother training convergence and reduced fluctuations in loss curves as compared to alternative compression strategies.

Model-based ADC via reinforcement learning generalizes across architectures (VGG, ResNet, MobileNet) and tasks (classification, detection), with learned policies outperforming manually crafted schemes (Hakkak, 2018). Attention-based CNN compression post-processing yields highest PSNR and MS-SSIM among post-codec restoration pipelines (Xue et al., 2019).

6. Mathematical Formalisms and Reproducible Algorithms

Key mathematical tools in ADC-related research include:

Continuous action space $a \in (0, 1]$ for dynamic compression ratios.
Actor-critic optimization loss:

$L = \frac{1}{N} \sum_{i=1}^N (y_i - Q(s_i, a_i | \theta^Q))^2$

$y_i = r_i + \gamma Q'(s_{i+1}, u(s_{i+1} | \theta^{u'}) | \theta^{Q'})$

Attention-based clustering for batch merging:

$\mathbb{1}(z, i) = \text{indicator for assignment}$

Quantization objectives for post-attention representations:

$\min\, \mathbb{E}[-\log P(Q_E^{out})]$

BAN utilizes sharpened Jensen’s inequality

$\langle e^x \rangle \geq e^{\langle x \rangle + \frac{1}{2} \langle (x - \langle x \rangle)^2 \rangle}$

to preserve attention dependencies, with latent mapping encoders $q(z|s)$ supporting context-adaptive coding (Tetelman, 2021).

7. Implications and Applications

Attention-based Double Compression methodologies have broad reach in distributed split learning, model compression for resource-constrained deployment, post-processing pipelines in image restoration, and adaptive lossless coding. By extracting and preserving only the most informative dimensions (samples, tokens, channels, spatial regions), these techniques facilitate efficient, scalable, and accurate transmission, storage, and deployment of deep learning models. ADC methods are especially suitable for scenarios with bandwidth or compute constraints, such as edge computing, mobile inference, federated learning, and neural compression in embedded vision systems.

A plausible implication is the extension of ADC frameworks for future integration with more sophisticated attention modules, multi-objective optimization (e.g., latency, energy, accuracy), finer levels of semantic abstraction, and practical deployments in diverse modalities including vision, language, and structured data.

ADC Instance	Stage 1 Compression	Stage 2 Compression	Key Attention Metric
ViT split learning (Alvetreti et al., 18 Sep 2025)	Batch clustering via CLS scores	Token selection via top-k scores	Average attention from CLS token
Model compression (Hakkak, 2018)	RL-derived sparsity ratios	Channel/spatial pruning	Layer state vector (embedding)
Post-processing CNN (Xue et al., 2019)	Channel attention	Spatial attention	Channel/spatial attention multipliers

The structural duality of ADC, often implemented as attention-guided selection along two or more axes, provides a unifying paradigm for efficient, adaptive, and robust compression in contemporary deep learning systems.