SeqCoAttn: Sequential Cross-Attention
- SeqCoAttn is a sequential cross-attention mechanism that decomposes attention across tasks, scales, and modalities to enable adaptive feature fusion.
- It integrates CTAM and CSAM modules, allowing refined task-specific and cross-scale feature propagation while reducing computational overhead.
- Empirical evaluations in visual scene understanding and speech enhancement demonstrate that SeqCoAttn improves accuracy and robustness with state-of-the-art efficiency.
Sequential Cross-Attention (SeqCoAttn) refers to a class of attention-based neural architectures that sequentially apply cross-attention across multiple interaction axes—such as between tasks, between spatial scales, or between a signal and its context—enabling selective, content-adaptive feature transfer with reduced computational overhead. The term encompasses mechanisms for visual multi-task learning as described in "Sequential Cross Attention Based Multi-task Learning" (Kim et al., 2022), and for context modeling in speech enhancement as exemplified by the cross-attention Conformer architecture (Narayanan et al., 2021). These designs exploit the sequential decomposition of attention into modular cross-task, cross-scale, or cross-modal operations to enhance information flow while maintaining tractable complexity.
1. SeqCoAttn in Multi-Task Visual Scene Understanding
SeqCoAttn in multi-task learning is instantiated to enable effective feature sharing across tasks (e.g., segmentation, depth, surface normals) while minimizing harmful interference. The architecture consists of the following main components:
- Feature Extraction Backbone: A convolutional network with multi-scale feature outputs () at resolutions {1/4, 1/8, 1/16, 1/32}. Each is augmented by two Swin-Transformer self-attention blocks; features are then fused via a convolution, strengthening both local and long-range dependencies.
- Task-Specific Heads: Each scale's features are processed by task-specific heads, yielding for task at scale .
- Cross-Task Attention Module (CTAM): Applies cross-attention at each scale , letting each task feature attend to the remainder at the same scale. Query, key, and value projections are learned convolutions; attention weights are
Refined task features are produced via concatenation and fusion using convolutional layers, with a residual connection for stability.
- Cross-Scale Attention Module (CSAM): For each task, CSAM fuses information across scales for each , letting features at fine scales attend to coarser scale features , . Output at each scale is the concatenation of the current and attended coarser features, further refined by a convolution.
- Prediction Layer: The refined per-scale task features are upsampled, concatenated, and projected to task outputs via a final convolution.
The pipeline is depicted in the following table for the core modules:
| Module | Input Features | Operation Scope |
|---|---|---|
| CTAM | Across tasks, per scale | |
| CSAM | Across scales, per task |
2. Mathematical Formulation and Attention Flow
SeqCoAttn sequentially composes attention as follows:
- Cross-Task: For each scale, all task features compute queries, keys, and values using task-specific projections. For a target task , cross-attention to every other task yields aggregative features.
- Feature Propagation: After CTAM, feature maps from coarser to finer scales are upsampled with the Feature Propagation Module with Attention (FPMA) and injected into subsequent scales.
- Cross-Scale: Each task fuses its CTAM-refined representations using cross-scale attention, with queries from the finer scale and keys/values from coarser scales, forming task- and content-dependent multi-receptive field features.
Both CTAM and CSAM utilize softmax-scaled dot-product attention:
Attended features from sources are concatenated and fused via convolution, with residual connections to retain original content.
Self-attention precedes cross modules, computed as:
The result is concatenated with the original (CNN feature).
3. Computational Complexity
A joint attention approach across all tasks and scales quickly becomes intractable, scaling as , where is the number of spatial locations at scale .
SeqCoAttn achieves significant reduction:
- CTAM: , since attention for each task at each scale is computed over sources.
- CSAM: , from attending to up to coarser feature maps per scale.
- Total: , where is the average spatial element count per scale.
This sequential decomposition yields a substantial efficiency advantage over naïve full cross-attention.
4. Empirical Validation
Extensive experiments demonstrate the empirical superiority of the SeqCoAttn mechanism in multi-task visual scene understanding:
- NYUD-v2 (Depth & Segmentation):
- Single-task baseline: RMSE=0.644, mIoU=35.04
- Multi-task baseline: RMSE=0.674, mIoU=35.03
- ATRC prior SOTA: RMSE=0.613, mIoU=40.99
- SeqCoAttn: RMSE=0.604, mIoU=41.33, mAm=+12.07%
- NYUD-v2 (Depth, Segmentation, Normals):
- SeqCoAttn: RMSE=0.584, Seg mIoU=40.50, Normal μErr=20.59, mAm=+4.82%
- PASCAL-Context (Segmentation & Normals):
- Single-task: mIoU=57.33, μErr=14.87
- SeqCoAttn: mIoU=60.09, μErr=13.89, mAm=+5.69%
- Ablation on NYUD-v2: Incremental addition of self-attention, FPMA, and CTAM+CSAM yields cumulative mIoU improvements, with CTAM+CSAM responsible for +6.10% over baseline (mIoU=41.33, RMSE=0.604).
These results indicate that the sequential application of cross-attention modules for task and scale provides state-of-the-art accuracy, with particularly notable gains in multi-task setups where feature transfer is non-trivial (Kim et al., 2022).
5. SeqCoAttn for Contextual Modeling in Speech Enhancement
SeqCoAttn is also instantiated in the cross-attention Conformer for speech enhancement, where the context (e.g., a noise segment) may differ in length and content from the target signal. Key aspects include:
- Feature Streams: Input () and context () features are projected to queries, keys, and values (, , ), followed by scaled dot-product attention.
- Layer Structure: The standard Conformer, consisting of FFN, convolution, self-attention, and FFN, is modified. The self-attention block is replaced with two cross-attention blocks separated by a Feature-wise Linear Modulation (FiLM) merging function.
- Network Topology: Speech and noise encoders process their respective streams, which are then integrated via two cross-attention Conformer layers. The full system totals 19M parameters.
Performance on ASR robustness benchmarks:
| Condition | No Enhancement | Baseline (No Context) | SeqCoAttn Conformer |
|---|---|---|---|
| LibriSpeech (–5/0/5 dB, WER %) | 36.5 / 22.5 / 14.0 | 33.6 / 20.4 / 13.3 | 31.8 / 19.3 / 12.7 |
| Vendor Noise (0/6/12 dB, WER %) | 46.3 / 19.0 / 8.6 | – | 34.6 / 15.0 / 6.4 |
| Unseen Multi-Talker (–5/0/5 dB) | 69.2 / 46.4 / 29.3 | – | 50.4 / 33.6 / 22.8 |
SeqCoAttn achieves up to 28% relative WER reduction in challenging conditions without degrading clean speech recognition (Narayanan et al., 2021).
6. Key Characteristics and Modular Design Principles
SeqCoAttn architectures are defined by:
- Sequential application of cross-attention along interaction axes (tasks, scales, modalities).
- Modularity, allowing continuous feature refinement and selective information flow.
- Efficiency, as decoupling attention over axes (rather than over all pairs) significantly reduces computation without sacrificing representational capacity.
- Compatibility with transformative backbones (e.g., Swin-Transformer for visual, Conformer for speech) and downstream outputs.
- Residual connections and convolutional bottlenecks, which stabilize training and enable integration with standard neural feature pipelines.
A plausible implication is that this architectural paradigm can be extended to other domains requiring structured multi-source and multi-context information fusion.
7. Summary and Perspectives
Sequential Cross-Attention mechanisms provide a scalable, content-adaptive, and experimentally validated approach for structured information transfer in multi-task, multi-scale, and context-aware learning scenarios. By decomposing cross-attention across axes—first by task, then by scale, or by signal and context—these models achieve substantial empirical gains in both visual scene understanding (Kim et al., 2022) and speech enhancement (Narayanan et al., 2021), while managing computational resource requirements. The modular composition of attention operations offers both architectural flexibility and a template for future research in disentangled multi-source information integration.