Parallel Attention-FFN (PAF)
- The paper introduces PAF, highlighting the simultaneous execution of self-attention and FFN sub-blocks, which reduces normalization overhead and boosts parallelism.
- PAF is defined as a transformer architectural variant where both attention and FFN operate in parallel without extra parameters, preserving isotropy in token embeddings.
- Empirical results demonstrate that PAF maintains near-equivalent benchmark performance while enabling flexible GPU disaggregation to meet low-latency and large-context requirements.
Parallel Attention-FFN (PAF) encompasses both a transformer architectural motif defined by the concurrent computation of self-attention and feed-forward sub-blocks (at the layer or kernel level), and a large-scale inference system primitive in which attention and FFN operations execute on disjoint accelerator pools (operator-level disaggregation). Two independent lines of research have developed PAF: (1) as a layer architectural variant for efficient model training with reduced normalization overhead and improved parallelism (Sonkar et al., 2023); and (2) as an MoE serving system primitive (often termed Attention–FFN Disaggregation or AFD) aimed at harnessing the distinct memory and compute characteristics of attention and MoE-FFN sub-blocks for optimal GPU allocation and cluster efficiency at scale (Wu et al., 27 May 2026, Liu et al., 10 Feb 2026). The following description synthesizes both developments, their operational principles, empirical results, and implications for large-model deployment and system co-design.
1. Formal Definition and Architectural Variants
The Parallel Attention–FFN layer replaces the canonical series arrangement of attention and FFN sub-blocks (with their own intervening layer-normalizations) as follows. Let be the input to transformer layer with tokens and hidden dimension . In the Series Attention–FFN (SAF) paradigm: where is multi-head self-attention, the two-layer FFN, and LN denotes layer normalization.
In the PAF formulation, both attention and FFN operate in parallel on , yielding: This reduces normalization overhead (two to one per layer) and, critically, allows true parallel execution of the attention and FFN modules. No new trainable parameters or projections are introduced. The same input seeds both 0 and 1, whose outputs are summed and normalized once (Sonkar et al., 2023).
At system-level, PAF/AFD denotes breaking each transformer block into two distinct operator groups—one responsible for attention variants (memory-bound KV-cache lookups, context-dependent matmuls) and the other for MoE-FFN kernels (compute-bound dense or expert GEMMs)—each assigned to separate GPU pools to reflect their heterogeneous resource demands (Wu et al., 27 May 2026). This operator-level split allows independent microbatching, pipelining, and resource allocation.
2. Theoretical Rationale and Empirical Support
Two central findings underlie the viability of the PAF layer architecture (Sonkar et al., 2023):
- FFN maintains isotropy and prevents degeneration: Without an FFN, transformer blocks rapidly collapse token embeddings to a near rank-1 manifold. The isotropy measure 2 for embedding matrix 3,
4
remains low and stable (5–6) across all layers in both SAF and PAF, but trends to 7 (degenerate) in pure-attention stacks.
- Attention residuals are small compared to the input: Denoting 8, 9, empirical results show 0 (often 1) across layers. Thus, the parallel-input FFN receives nearly the same statistical signal as it would in the serial case.
Extensive pre-training and fine-tuning (on GLUE with RoBERTa-large and BERT-large-uncased) demonstrate negligible performance drops with PAF: for BERT-large-uncased, mean GLUE score reduces by only 2 points (89.6 SAF vs 89.5 PAF), RoBERTa-large drops by 3 (92.8 SAF vs 92.2 PAF), with PAF trained on much less data (Sonkar et al., 2023).
In inference-disaggregation contexts, operator-level PAF/AFD is motivated by the dichotomy that attention compute becomes memory-bandwidth-bound as context increases, while MoE-FFN modules are compute-bound and parallelizable across experts. Empirical studies reveal that only AFD configurations can achieve strict latency (TTFT, TPOT) SLOs for large-context, low-latency workloads at system scale; non-AFD schemes fail to reach feasibility even with maximal GPU allocations (Wu et al., 27 May 2026).
3. Disaggregation Levels, Design Space, and Scheduling
There are three principal levels of disaggregation for LLM inference pipelines (Wu et al., 27 May 2026):
- Chunked prefill aggregation: Prefill operations broken into chunks to smooth build and drain overhead, but attention and FFN are still co-executed as a monolithic stage.
- Prefill–decode (P/D) disaggregation: The prefill (building static prefix KV-cache) and decode (generating new tokens) phases are scheduled on different GPU groups to exploit distinct resource footprints.
- Operator-level PAF/AFD: Attention and (MoE-)FFN operations are split at the operator level, each assigned to dedicated GPU pools, allowing independent optimization and pipelined overlap.
Within the operator-level disaggregation design space, key parameters include:
- Workload: input/output sequence length, KV-cache reuse (e.g., agentic coding workloads with 524K-token prefixes), per-user TTFT/TPOT constraints.
- Resource allocation: total GPUs per replica, attention vs FFN split ratios (A:F), prefill vs decode group sizes, and microbatch pipeline depth (4).
- Interconnect topology: binding high-frequency per-layer AFD traffic (e.g., MoE dispatch/combines) to highest-bandwidth NVLink (scale-up), routing rare KV-cache transfers over InfiniBand (scale-out).
Scheduling strategies apply the “rate-matching” principle: select A:F such that 5, where 6 and 7 are per-token attention and FFN costs, respectively. For most models, cheap attention variants require minimal attention GPUs (8), but dense, high-KV attention necessitates higher 9 allocations (e.g., Qwen3 chat: 0 in the tens of percent; Nemotron3 agentic: 1 at 128 GPUs). Microbatch overlap is tuned for maximal duplex throughput.
4. System Performance Modeling and Empirical Results
The pipelined latency model for microbatched AFD serving is: 2 with 3 microbatches and 4 the slowest stage. TTFT and TPOT are computed summing across prefill and decode, accounting for cross-microbatch overlap (Wu et al., 27 May 2026).
On 128×NVIDIA B200 clusters, AFD consistently enables system throughput near 5K tokens/s for DeepSeek-V3.2, with strict (TTFT, TPOT) SLOs met for diverse workloads (chat: TTFT < 50 ms; coding: TTFT < 100 ms; agentic coding: TTFT < 150 ms). For large-context and agentic workloads, non-AFD layouts exceeded per-GPU memory limits (>180 GiB), while AFD/PAF reduced per-GPU memory to ~165 GiB and enabled feasible serving. Pareto analyses show that chunked prefill aggregation wins on raw throughput when latency is unconstrained, but only disaggregation+AFD architectures fall on the low-latency efficiency frontier (Wu et al., 27 May 2026).
5. Communication, Hardware Utilization, and Scalability Limits
A detailed roofline-based analysis extending compute–memory–communication constraints clarifies the AFD/PAF scaling limits (Liu et al., 10 Feb 2026). For an MoE model, hardware FLOPS utilization (HFU) is bounded by both peak compute (6) and effective communication bandwidth (7), determined by scale-out (InfiniBand) and scale-up (NVLink) links: 8 where 9 is the arithmetic intensity.
An inherent “dead-zone” arises: increasing FFN instance count (0) does not boost HFU once 1 outpaces the token dispatch bandwidth ceiling; the operator's active time is squeezed by rising communication latency, especially with fine-grained expert partitioning and limited interconnect. On standard clusters (e.g., H800), AFD HFU tops at ~33% (vs ~60% for EP), never outperforming expert parallelism. With superpod-class hardware (GB300), coarse experts, and lowered sparsity, AFD can reach HFU ≈ 66%, outperforming EP’s ≈60%. Thus, AFD’s system-level viability is contingent on model granularity and interconnect class (Liu et al., 10 Feb 2026).
Compared to EP, AFD suffers additional imbalance penalties from the discrete allocation of attention/FFN nodes under dynamic loads, with analytic penalties demonstrated in [(Liu et al., 10 Feb 2026), Eq. 22–23]; flexible pod partitioning can partially alleviate this.
6. Design Principles, Limitations, and Deployment Guidance
The following design principles and constraints emerge from comprehensive study:
- When to use operator-level PAF/AFD: Employ operator-level disaggregation when (a) per-request latency (TTFT/TPOT) is a binding constraint, or (b) model, context, or prefix size push per-GPU memory beyond feasible limits for monolithic or phase-disaggregated deployments.
- Attention–FFN split: Allocate attention and FFN GPU pools via rate-matching per-token costs, expanding the attention slice proportionally as attention cost or context grows.
- Network-aware topology: Map frequent AFD/MoE dispatches to intra-node NVLink; route infrequent large KV-cache transfers over scale-out fabric. Co-locating prefill–decode slices reduces KV transfer latency (2 improvement over disjoint-node setups).
- Limitations: Operator-level AFD/PAF increases per-replica GPU consumption (decreasing data-parallel concurrency under fixed resources) and performs optimally only on fast, full-duplex interconnects. For throughput-centric or loose-latency workloads, chunked or P/D aggregation may suffice or outperform. Discrete partitioning for attention/FFN pools can introduce additional resource imbalance relative to EP.
- Model/hardware trade-offs: AFD/PAF is preferred for models with coarse expert granularity, low sparsity, and access to superpod-scale bandwidth. For fine-grained MoE or bandwidth-limited cluster deployments, large-scale EP remains superior (Liu et al., 10 Feb 2026, Wu et al., 27 May 2026).
7. Research Directions and Open Questions
Open areas include:
- Scaling and generalization: Current PAF studies are primarily on English models up to 800M parameters; the scaling behavior on multi-billion parameter or multilingual architectures is an open topic (Sonkar et al., 2023).
- FFN alternatives: Given that the FFN’s principal empirically validated utility is isotropy preservation, there is scope for replacement with lighter “isotropy-enforcing” modules (e.g., orthogonal transformations, noise injection) (Sonkar et al., 2023).
- Serial vs parallel sub-block utility: Under what circumstances, especially in generation or code-modeling tasks, does the additional signal from feeding attention-modulated tokens into FFN (as in SAF) materially benefit downstream performance?
- Broader microarchitecture co-design: The insight that compute/memory bandwidth heterogeneity should drive operator placement, resource partition, and pipeline orchestration can inspire more radical modular decompositions as new model and hardware generations emerge.
In summary, Parallel Attention–FFN (PAF) represents both a transformer architectural adaptation for efficient parallelization and an inference-disaggregation primitive for Mixture-of-Experts LLM serving. Its efficacy and gains versus alternative approaches depend crucially on model detail, workload, and cluster interconnect, and it serves as a blueprint for system-model codesign targeting low-latency, large-context, high-throughput deployments (Sonkar et al., 2023, Wu et al., 27 May 2026, Liu et al., 10 Feb 2026).