Papers
Topics
Authors
Recent
Search
2000 character limit reached

Asymmetric Co-Attention Block (ACB)

Updated 30 May 2026
  • ACB is an attention mechanism that integrates information from two sources in a non-symmetric, directionally biased manner.
  • It utilizes lightweight channel gating and one-way cross-attention to effectively address sequence length and feature discrepancies.
  • Empirical studies indicate that ACB enhances tasks like image restoration and video moment retrieval while reducing computational overhead.

An Asymmetric Co-Attention Block (ACB) is an attention-based neural module designed to adaptively integrate information from two sources—such as distinct feature states or different modalities—by explicitly modeling their interactions in a non-symmetric manner. Unlike symmetric co-attention, which treats both inputs equivalently, ACB mechanisms prioritize directional information flow or weighted fusion, thereby addressing specific challenges such as sequence length imbalance or feature disparity. The ACB concept has been instantiated in both convolutional (e.g., image restoration and super-resolution) and transformer-based (e.g., cross-modal grounding) contexts, and is characterized by lightweight parametrization, explicit channel- or token-level competition, and empirical improvements over naive fusion strategies (Li et al., 2020, Li et al., 2020, Panta et al., 2023).

1. Motivation and Problem Setting

The development of ACB was motivated by two principal challenges observed in deep neural network architectures:

  • Inefficient fusion of multi-state or multi-modal features: In deep vision backbones (e.g., for super-resolution or restoration), single-path streams fail to propagate low-level context effectively to deeper layers. Multimodal tasks like video moment retrieval face information asymmetry, where modalities differ strongly in sequence length or representational density (Li et al., 2020, Li et al., 2020, Panta et al., 2023).
  • Asymmetry in input data: In cross-modal settings, the video modality often comprises tens or hundreds of segment tokens, while the paired language modality is much shorter. Naive dual-stream cross-attention or concatenation can bias feature fusion towards the longer modality or incur computational overhead (Panta et al., 2023).

By introducing directional (asymmetric) co-attention blocks, networks can learn to adaptively emphasize informative features from preferred sources and mitigate mode imbalance.

2. Architectural Instantiations

2.1 CNN-Based: Feature Map Fusion (Image Restoration/SR)

In deep interleaved networks (DIN), ACB modules are integrated at every feature fusion point between parallel branches. Each ACB takes as input two feature tensors X1,X2∈RC×H×WX_1, X_2 \in \mathbb{R}^{C\times H\times W} (typically, one from the previous branch and one from the current branch’s previous layer) and outputs a fused tensor of identical shape. The process involves four steps: concatenation and integration, global average pooling (squeeze), channel gating with bottleneck, and per-channel asymmetric fusion (Li et al., 2020, Li et al., 2020).

2.2 Transformer-Based: Cross-Modal Token Attention

In video moment retrieval and similar tasks, ACB is implemented as a one-way multi-head cross-attention block within transformer architectures. Here, the shorter text sequence T∈Rm×dT\in\mathbb{R}^{m\times d} queries the longer video sequence V∈Rn×dV\in\mathbb{R}^{n\times d}, but not vice versa. Multi-head projections, scaled dot-product attention, feed-forward update, and skip connections are utilized, but all attention is directionally text →\to video, never the reverse (Panta et al., 2023).

Summary Table: ACB Variants Across Tasks

Context Inputs Fusion Mechanism
Image Restoration/SR X1X_1, X2X_2 Channel-wise gating
Video Moment Retrieval VV, TT One-way cross-attn

3. Mathematical Formulation and Forward Pass

3.1 CNN Fusion Version

Given X1,X2∈RC×H×WX_1, X_2 \in \mathbb{R}^{C\times H\times W}, the canonical ACB (AsyCA) computes:

  1. Concatenate: X~=concat(X1,X2)∈R2C×H×W\tilde X = \text{concat}(X_1, X_2) \in \mathbb{R}^{2C\times H\times W}
  2. Integrate: T∈Rm×dT\in\mathbb{R}^{m\times d}0
  3. Squeeze: T∈Rm×dT\in\mathbb{R}^{m\times d}1 T∈Rm×dT\in\mathbb{R}^{m\times d}2 T∈Rm×dT\in\mathbb{R}^{m\times d}3
  4. Channel gating:
    • T∈Rm×dT\in\mathbb{R}^{m\times d}4 T∈Rm×dT\in\mathbb{R}^{m\times d}5, T∈Rm×dT\in\mathbb{R}^{m\times d}6
    • split T∈Rm×dT\in\mathbb{R}^{m\times d}7 into T∈Rm×dT\in\mathbb{R}^{m\times d}8
  5. Per-channel softmax and fusion:
    • T∈Rm×dT\in\mathbb{R}^{m\times d}9
    • V∈Rn×dV\in\mathbb{R}^{n\times d}0

This module yields one fused output, not two reciprocally attended ones, and the asymmetric softmax ensures inputs compete per channel, enabling adaptive emphasis of informative states (Li et al., 2020, Li et al., 2020).

3.2 Transformer Cross-Attention Version

Given V∈Rn×dV\in\mathbb{R}^{n\times d}1 (video tokens), V∈Rn×dV\in\mathbb{R}^{n\times d}2 (text tokens), multi-head cross-attention is defined as:

For each head V∈Rn×dV\in\mathbb{R}^{n\times d}3:

  • V∈Rn×dV\in\mathbb{R}^{n\times d}4, V∈Rn×dV\in\mathbb{R}^{n\times d}5, V∈Rn×dV\in\mathbb{R}^{n\times d}6
  • V∈Rn×dV\in\mathbb{R}^{n\times d}7
  • V∈Rn×dV\in\mathbb{R}^{n\times d}8
  • Aggregate heads, project, and feed-forward to produce visually-informed text (Panta et al., 2023)

Crucially, no attention flows from video to text. This unidirectionality both resolves sequence length disparities and preserves spatio-temporal localization in video.

4. Integration within Broader Architectures

4.1 Deep Interleaved Networks

Within DIN, ACB modules are mounted at every interleaved node, fusing outputs of weighted residual dense blocks (WRDBs) between branches. For branch V∈Rn×dV\in\mathbb{R}^{n\times d}9 and block →\to0:

  • If →\to1
  • If →\to2

→\to3 provides adaptive channel selection before each WRDB, thus controlling flow and integration of both shallow and deep features (Li et al., 2020, Li et al., 2020).

4.2 Video Moment Retrieval Pipelines

In cross-modal transformers, ACB appears at the start of each cross-modal block:

  1. Video C3D embeddings →\to4 and text GloVe embeddings →\to5 are projected.
  2. ACB fuses text queries with visual tokens via one-way attention for S layers.
  3. Outputs are concatenated and passed to a single-stream self-attention module (CAB).
  4. Multi-stage heads process final representations for video proposal ranking and masked language prediction (Panta et al., 2023).

Ablation shows that replacing bidirectional or decoupled cross-attention with ACB, especially under momentum-contrastive pre-training, yields a significantly leaner model at equal or better accuracy.

5. Comparative Analysis and Ablation Studies

Empirical studies consistently show that ACB-based fusion outperforms nonadaptive concatenation, sum, or symmetric methods in both CNN and transformer architectures.

  • Baseline (no ACB): 37.61 dB PSNR (Set5, ×2 SR, 50 epochs)
  • +ACB alone: 37.65 dB
  • +ACB + WRDB weighted connections: 37.74 dB
  • Full DIN: 37.77 dB
  • ACB delivers the largest individual improvement and the most stable convergence compared to sum/concat fusion. The same qualitative trends are observed in restoration benchmarks (Li et al., 2020).
  • ACB alone: marginal or no improvement in isolation
  • ACB + momentum contrastive loss + CAB: R@1, [email protected] on TACoS improves from 48.79 to 49.77 with 35% fewer parameters
  • ACB allows significant model compression while maintaining SOTA or near-SOTA accuracy

The single fused output and per-channel softmax have been found to increase model discriminability and stabilize training.

6. Design Choices and Computational Considerations

Key design choices in ACB implementations include:

  • Channel dimension: C=64 in typical DIN branches
  • Reduction ratio (MLP bottleneck): →\to6 or →\to7 depending on depth
  • All convolutions: 1×1, no normalization layers; ReLU as the only activation
  • Pooling: Average across spatial or sequence dimension only
  • Parameter efficiency: Only two 1×1 convolutions and two attention vectors per fusion point; multi-head attention uses standard transformer parametrization in token-based ACBs
  • Cost scaling: Channel-based ACB is →\to8 per position; transformer-based one-way ACB reduces attention cost from →\to9 to X1X_10

These design choices ensure ACB imposes negligible computational or memory overhead relative to the main architecture.

7. Contextual Significance and Distinctive Attributes

ACB mechanisms differ fundamentally from classical co-attention:

  • No reciprocal mapping: Inputs do not cross-attend in both directions.
  • Softmax competition occurs per-channel or per-token pair, not spatially or sequence-wide.
  • Only one fused output is produced.
  • Directional selectivity enables the network to adaptively filter and combine incoming streams according to their relevance.

This asymmetry is particularly advantageous in architectures where information balance or discriminative fusion is non-trivial. Qualitative studies indicate better high-frequency detail restoration and more robust cross-modal alignment.

A plausible implication is that this class of block is especially beneficial in settings with severe sequence length or feature abstraction disparities, and is likely to persist as an architectural default in multidirectional/multimodal deep learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Asymmetric Co-Attention Block (ACB).