AFD: Decoupling Attention and FFN

Updated 28 July 2025

AFD is a strategy that decouples attention and FFN components, enabling targeted optimization and specialized hardware utilization in Transformer models.
It employs techniques ranging from model-internal reparameterization for few-shot object detection to system-level designs in large-scale language model serving.
AFD enhances interpretability, debiasing, and cost-efficiency while boosting task-specific performance across vision, language, and speech applications.

Attention-FFN Disaggregation (AFD) refers to architectural and/or system-level strategies that explicitly decouple the attention and feed-forward network (FFN) components within Transformer-based models or other multi-stage neural architectures. Originally, these two components were tightly coupled within each processing block, but recent research shows that disaggregation enables targeted optimization, efficiency gains, enhanced interpretability, and improved task or system-level performance. Approaches to AFD range from model-internal reparameterizations and pipeline specialization to distributed inference system design, with relevance in vision, language, speech, and system deployment contexts.

1. Conceptual Foundation and Motivation

The attention and FFN submodules in Transformers have fundamentally different computational profiles. Attention layers require intensive memory access and key–value (KV) cache management, with relatively fewer parameters, while FFNs, particularly when implemented as (sparse) Mixtures-of-Experts (MoE), are heavily parameter- and compute-bound but stateless across tokens. AFD is motivated by the observation that treating these modules as monolithic blocks results in missed opportunities for hardware specialization, computational redundancy, and sub-optimal model dynamics.

Disaggregation aims to exploit this asymmetry by:

Scheduling attention and FFN operations on different hardware or subsystems with tailored parallelism and batching strategies.
Enabling separate feature extraction and processing pipelines for multiple object detection subtasks, as in the case of classification and localization branches (Liu et al., 2020).
Reducing redundancy and resource utilization inefficiency in distributed inference and serving systems by task-specific decomposition (StepFun et al., 25 Jul 2025, Liang et al., 26 Mar 2025).
Enhancing modularity for interpretability, debiasing, and expert parameter-sharing (Zhou et al., 2024, Yang et al., 12 May 2025).

2. Model-Internal Disaggregation in Few-Shot Object Detection

In adaptive fully-dual network (AFD-Net) for few-shot object detection, disaggregation is directly encoded in the feature-processing and aggregation paths (Liu et al., 2020):

Dual Query Encoder (DQE): Produces separate region-of-interest (RoI) feature vectors $\{r_i^{cls}, r_i^{reg}\}_{i=1}^n = \mathcal{E}(\mathcal{R}(\mathcal{B}(Q)))$ , splitting processing for classification and regression.
Dual Attention Generator (DAG): Yields subtask-specific class-attentive support features via convolutional and fully-connected encoders, adaptively fused by learnable weights $\lambda^t_{conv}$ , $\lambda^t_{fc}$ for $t \in \{\text{cls}, \text{reg}\}$ .
Dual Aggregator (DA): Jointly aggregates these representations for each subtask through separate elementwise operations and FC layers, enhancing task-specific meta-feature activation.
Adaptive Fusion Mechanism (AFM): Fuses information from both task branches to improve feature richness for both classification and localization.

Separate features are used for each subtask, followed by distinct downstream detectors, yielding enhanced performance—especially in data-constrained regimes—demonstrating the utility of disaggregation for task specialization and feature expressiveness.

3. System-Level AFD in Distributed Serving and Cost-Efficient Decoding

In large-scale LLM serving (e.g., Step-3 (StepFun et al., 25 Jul 2025) and Adrenaline (Liang et al., 26 Mar 2025)), AFD manifests as an explicit systems co-design for throughput and resource utilization:

Subsystem Specialization: The serving architecture places attention computations (memory-intensive, key–value cache–dependent) and FFN/MoE computations (compute-intensive, batch-friendly) on separate pools or groups of GPUs.
Pipelined Token Processing: Each decoded token transits sequentially through specialized hardware stages: attention, (network) communication, and FFN, with interleaved execution to minimize time-per-output-token (TPOT) latency.
Mathematical Isolation of Costs: Decoding cost is split as:

$\text{Attention Cost} = \max(\text{FLOP}_{\text{Attn}} \times U_{\text{FLOP}},\ \text{Byte}_{\text{KV}} \times U_{\text{byte}}) + \text{FLOP}_{\text{Linear}} \times U_{\text{FLOP}}$

$\text{FFN Cost} = \text{FLOP}_{\text{FFN}} \times U_{\text{FLOP}}$

Resource Reallocation: Hardware for attention is selected for memory bandwidth, while FFN hardware is optimized for arithmetic intensity and batch throughput.

Empirical results show that AFD enables Step-3 to reach a decoding throughput of up to 4,039 tokens/s/GPU under a 50 ms TPOT SLA for 4K context, significantly surpassing strong baselines like DeepSeek-V3 and Qwen3 MoE 235B, with only 32–48 GPUs required (StepFun et al., 25 Jul 2025).

4. Analytical and Mathematical Formulations

Attention-FFN Disaggregation is not limited to hardware allocation but is often formalized at the level of architectural design:

Component	Mathematical Formulation	Disaggregation Role
Dual Query Encoder	$\{r_i^{cls}, r_i^{reg}\} = \mathcal{E}(\mathcal{R}(\mathcal{B}(Q)))$	Feature splitting per task
Dual Attention Gen.	$\{a_j^{cls}, a_j^{reg}\} = \mathcal{G}(\mathcal{B}([S_j, M_j]))$	Subtask-aware attention
Dual Aggregator	$r_{i,j}^t = [f_m(r_i^t \odot a_j^t), f_s(r_i^t - a_j^t), r_i^t]$	Task-specific aggregation
System pipeline cost	$\text{Total Cost} = \text{Attention Cost} + \text{FFN Cost}$	Resource-targeted deployment

This explicit separation in math and compute mapping underpins both the accuracy and throughput improvements observed.

5. Architectural and Algorithmic Innovations

AFD enables the implementation of advanced architectural strategies:

Expert Parameter Sharing: In UMoE, attention and FFN modules are unified as a set of shared experts dispatched via a top- $k$ router, leveraging the insight that token mixing (attention weights) and per-token processing (FFN) are amenable to a unified two-matrix structure (Yang et al., 12 May 2025).
Sparsity and Redundancy Control: Statistical top- $k$ operators enable both FFN and attention layers to realign active computation to a small fraction of tokens/neurons, as in the Spark Transformer (You et al., 7 Jun 2025).
Interpretable Disaggregation for Debiasing: Masking attention heads and FFN vectors by their contribution to output bias in LLMs enables targeted mitigation strategies without retraining (Zhou et al., 2024).

6. Comparative Performance and Impact

Empirical studies across modalities demonstrate that AFD leads to concrete improvements:

In object detection, explicit dual-branch processing yields stronger novel-class generalization and allows the effective fusion of class- and localization-specific features, achieving higher mAP, especially in challenging few-shot regimes (Liu et al., 2020).
In distributed LLM inference, Step-3’s AFD underpins new cost-efficiency frontiers (up to 4,039 tokens/s/GPU) by optimizing hardware utilization per stage and context length (StepFun et al., 25 Jul 2025).
Under AFD, parameter counts, FLOPs, or inference wall-time may be reduced without sacrificing (and sometimes even improving) predictive performance, as is the case in EfficientASR (Wang et al., 2024) and Spark Transformer (You et al., 7 Jun 2025).

7. Broader Implications and Future Research Directions

Attention-FFN Disaggregation represents a paradigm shift from monolithic Transformer design toward modular, task- or hardware-aligned processing and opens several research directions:

Adaptive modularity: Dynamic routing and expert sharing based on task context, token characteristics, or resource profiling.
Interpretable control: Identification and manipulation of internal model components for transparency, debiasing, or reliability without retraining (Zhou et al., 2024).
Scalable system co-design: Joint optimization of model architecture and serving systems via pipeline parallelism, heterogeneous hardware deployment, and disaggregated compute mapping (StepFun et al., 25 Jul 2025).
Generalization across domains: Application of AFD principles in vision, language, speech, and multimodal frameworks, leveraging its ability to reconcile efficiency and expressiveness.

Potential limitations include increased system complexity, the need for precise communication scheduling and synchronization, and initial partitioning/prioritization overheads, which require careful empirical tuning and profiling. Nonetheless, the growing body of research demonstrates that AFD is indispensable for achieving state-of-the-art efficiency and flexibility in large-scale model deployment and advanced task specialization.