SeqCoAttn: Sequential Cross-Attention

Updated 15 January 2026

SeqCoAttn is a sequential cross-attention mechanism that decomposes attention across tasks, scales, and modalities to enable adaptive feature fusion.
It integrates CTAM and CSAM modules, allowing refined task-specific and cross-scale feature propagation while reducing computational overhead.
Empirical evaluations in visual scene understanding and speech enhancement demonstrate that SeqCoAttn improves accuracy and robustness with state-of-the-art efficiency.

Sequential Cross-Attention (SeqCoAttn) refers to a class of attention-based neural architectures that sequentially apply cross-attention across multiple interaction axes—such as between tasks, between spatial scales, or between a signal and its context—enabling selective, content-adaptive feature transfer with reduced computational overhead. The term encompasses mechanisms for visual multi-task learning as described in "Sequential Cross Attention Based Multi-task Learning" (Kim et al., 2022), and for context modeling in speech enhancement as exemplified by the cross-attention Conformer architecture (Narayanan et al., 2021). These designs exploit the sequential decomposition of attention into modular cross-task, cross-scale, or cross-modal operations to enhance information flow while maintaining tractable complexity.

1. SeqCoAttn in Multi-Task Visual Scene Understanding

SeqCoAttn in multi-task learning is instantiated to enable effective feature sharing across tasks (e.g., segmentation, depth, surface normals) while minimizing harmful interference. The architecture consists of the following main components:

Feature Extraction Backbone: A convolutional network with multi-scale feature outputs ( $F^{1},...,F^{4}$ ) at resolutions {1/4, 1/8, 1/16, 1/32}. Each is augmented by two Swin-Transformer self-attention blocks; features are then fused via a $1\times 1$ convolution, strengthening both local and long-range dependencies.
Task-Specific Heads: Each scale's features are processed by $M$ task-specific heads, yielding $f_i^k \in \mathbb{R}^{H^k \times W^k \times D}$ for task $i$ at scale $k$ .
Cross-Task Attention Module (CTAM): Applies cross-attention at each scale $k$ , letting each task feature $f_i^k$ attend to the remainder $\{f_j^k\}_{j\ne i}$ at the same scale. Query, key, and value projections are learned $1\times 1$ convolutions; attention weights are

$A_{i\gets j}^k = \text{softmax}\left( \frac{Q_i^k (K_j^k)^\top}{\sqrt{d_k}} \right)$

Refined task features are produced via concatenation and fusion using convolutional layers, with a residual connection for stability.

Cross-Scale Attention Module (CSAM): For each task, CSAM fuses information across scales for each $k<4$ , letting features at fine scales $f_i'^{k}$ attend to coarser scale features $f_i'^{m}$ , $m=k+1,\ldots,4$ . Output at each scale is the concatenation of the current and attended coarser features, further refined by a $1\times 1$ convolution.
Prediction Layer: The refined per-scale task features are upsampled, concatenated, and projected to task outputs via a final $1\times 1$ convolution.

The pipeline is depicted in the following table for the core modules:

Module	Input Features	Operation Scope
CTAM	$\{f^{k}_i \,\, \forall i\}$	Across tasks, per scale
CSAM	$\{f'^{k}_i \,\, \forall k\}$	Across scales, per task

2. Mathematical Formulation and Attention Flow

SeqCoAttn sequentially composes attention as follows:

Cross-Task: For each scale, all task features compute queries, keys, and values using task-specific projections. For a target task $i$ , cross-attention to every other task $j\ne i$ yields aggregative features.
Feature Propagation: After CTAM, feature maps from coarser to finer scales are upsampled with the Feature Propagation Module with Attention (FPMA) and injected into subsequent scales.
Cross-Scale: Each task fuses its CTAM-refined representations using cross-scale attention, with queries from the finer scale and keys/values from coarser scales, forming task- and content-dependent multi-receptive field features.

Both CTAM and CSAM utilize softmax-scaled dot-product attention:

$A = \text{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right)$

Attended features from sources are concatenated and fused via $1\times 1$ convolution, with residual connections to retain original content.

Self-attention precedes cross modules, computed as:

$Y = \text{softmax}\left(\frac{\ell_q(X)\ell_k(X)^\top}{\sqrt{d}}\right)\ell_v(X)$

The result is concatenated with the original $X$ (CNN feature).

3. Computational Complexity

A joint attention approach across all $M$ tasks and $K$ scales quickly becomes intractable, scaling as $O\left((M\sum_k N_k)^2\right)$ , where $N_k = H^k W^k$ is the number of spatial locations at scale $k$ .

SeqCoAttn achieves significant reduction:

CTAM: $O\left(K M (M-1) N_k d_k\right)$ , since attention for each task at each scale is computed over $M-1$ sources.
CSAM: $O\left(M\sum_k (K - k) N_k d_k\right)$ , from attending to up to $K-k$ coarser feature maps per scale.
Total: $O(K M^2 \langle N \rangle + M K^2 \langle N \rangle)$ , where $\langle N \rangle$ is the average spatial element count per scale.

This sequential decomposition yields a substantial efficiency advantage over naïve full cross-attention.

4. Empirical Validation

Extensive experiments demonstrate the empirical superiority of the SeqCoAttn mechanism in multi-task visual scene understanding:

NYUD-v2 (Depth & Segmentation):
- Single-task baseline: RMSE=0.644, mIoU=35.04
- Multi-task baseline: RMSE=0.674, mIoU=35.03
- ATRC prior SOTA: RMSE=0.613, mIoU=40.99
- SeqCoAttn: RMSE=0.604, mIoU=41.33, mAm=+12.07%
NYUD-v2 (Depth, Segmentation, Normals):
- SeqCoAttn: RMSE=0.584, Seg mIoU=40.50, Normal μErr=20.59, mAm=+4.82%
PASCAL-Context (Segmentation & Normals):
- Single-task: mIoU=57.33, μErr=14.87
- SeqCoAttn: mIoU=60.09, μErr=13.89, mAm=+5.69%
Ablation on NYUD-v2: Incremental addition of self-attention, FPMA, and CTAM+CSAM yields cumulative mIoU improvements, with CTAM+CSAM responsible for +6.10% over baseline (mIoU=41.33, RMSE=0.604).

These results indicate that the sequential application of cross-attention modules for task and scale provides state-of-the-art accuracy, with particularly notable gains in multi-task setups where feature transfer is non-trivial (Kim et al., 2022).

5. SeqCoAttn for Contextual Modeling in Speech Enhancement

SeqCoAttn is also instantiated in the cross-attention Conformer for speech enhancement, where the context (e.g., a noise segment) may differ in length and content from the target signal. Key aspects include:

Feature Streams: Input ( $X$ ) and context ( $C$ ) features are projected to queries, keys, and values ( $Q= XW_Q$ , $K= CW_K$ , $V= CW_V$ ), followed by scaled dot-product attention.
Layer Structure: The standard Conformer, consisting of FFN, convolution, self-attention, and FFN, is modified. The self-attention block is replaced with two cross-attention blocks separated by a Feature-wise Linear Modulation (FiLM) merging function.
Network Topology: Speech and noise encoders process their respective streams, which are then integrated via two cross-attention Conformer layers. The full system totals 19M parameters.

Performance on ASR robustness benchmarks:

Condition	No Enhancement	Baseline (No Context)	SeqCoAttn Conformer
LibriSpeech (–5/0/5 dB, WER %)	36.5 / 22.5 / 14.0	33.6 / 20.4 / 13.3	31.8 / 19.3 / 12.7
Vendor Noise (0/6/12 dB, WER %)	46.3 / 19.0 / 8.6	–	34.6 / 15.0 / 6.4
Unseen Multi-Talker (–5/0/5 dB)	69.2 / 46.4 / 29.3	–	50.4 / 33.6 / 22.8

SeqCoAttn achieves up to 28% relative WER reduction in challenging conditions without degrading clean speech recognition (Narayanan et al., 2021).

6. Key Characteristics and Modular Design Principles

SeqCoAttn architectures are defined by:

Sequential application of cross-attention along interaction axes (tasks, scales, modalities).
Modularity, allowing continuous feature refinement and selective information flow.
Efficiency, as decoupling attention over axes (rather than over all pairs) significantly reduces computation without sacrificing representational capacity.
Compatibility with transformative backbones (e.g., Swin-Transformer for visual, Conformer for speech) and downstream outputs.
Residual connections and convolutional bottlenecks, which stabilize training and enable integration with standard neural feature pipelines.

A plausible implication is that this architectural paradigm can be extended to other domains requiring structured multi-source and multi-context information fusion.

7. Summary and Perspectives

Sequential Cross-Attention mechanisms provide a scalable, content-adaptive, and experimentally validated approach for structured information transfer in multi-task, multi-scale, and context-aware learning scenarios. By decomposing cross-attention across axes—first by task, then by scale, or by signal and context—these models achieve substantial empirical gains in both visual scene understanding (Kim et al., 2022) and speech enhancement (Narayanan et al., 2021), while managing computational resource requirements. The modular composition of attention operations offers both architectural flexibility and a template for future research in disentangled multi-source information integration.

Markdown Report Issue Upgrade to Chat

References (2)

Sequential Cross Attention Based Multi-task Learning (2022)

Cross-attention conformer for context modeling in speech enhancement for ASR (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sequential Cross-Attention (SeqCoAttn).