Series Attention–FFN (SAF)
- Series Attention–FFN (SAF) is a transformer block where a self-attention sublayer is sequentially followed by a feed-forward network, each with residual connections and normalization.
- It maintains representational diversity by using small-norm attention residuals and FFN-induced isotropy preservation, ensuring effective token embedding re-spread.
- SAF enables disaggregated LLM inference by assigning memory-bound attention and compute-bound FFN tasks to specialized hardware, optimizing overall system throughput.
Series Attention–Feed-Forward (SAF), also known as Attention–FFN, refers both to a canonical architectural block in stacked transformer models and to an emerging paradigm for disaggregating LLM serving workloads across specialized hardware resources. In its classical form, SAF denotes the ordered sequence in which a self-attention sublayer is followed by a feed-forward network (FFN) sublayer, each encapsulated by a residual connection and layer normalization. As a distribution strategy for inference acceleration, SAF (or Attention–FFN Disaggregation) separates memory-bound KV-cache-dominated attention computation from stateless compute-intensive FFN computation, enabling independent scaling and optimization of hardware resources.
1. Formal Architecture of the SAF Layer
Let denote the matrix of token embeddings at the input of layer . The Series Attention–Feed-Forward (SAF) layer is composed of two ordered subcomponents: a multi-headed self-attention mechanism (), and a position-wise two-layer FFN (), each followed by addition with the input ("residual"), then layer normalization (LN):
Here, is the self-attention function; is the FFN. The update is strictly sequential: first, self-attention contextualizes input tokens; second, the FFN reprojects the resulting representations. This sequential structure is preserved in all major transformer variants, including RoBERTa-large and BERT-large-uncased (Sonkar et al., 2023).
Pseudo-code for a single SAF block is: 1 An ASCII schematic highlights the residual boundaries and flow.
2. Theoretical Basis: Isotropy and Residual Norms
2.1 Role of FFN: Isotropy Preservation
Deep stacks of self-attention, when deployed without FFN or residual additions, exhibit a collapse of token embeddings into near-uniform directions (loss of isotropy). This is formally measured by isotropy:
where is the 0th token embedding. 1 indicates collapse. SAF’s FFN re-spreads embeddings per layer, maintaining low 2, while omission of FFN yields rapid isotropic degeneracy (Sonkar et al., 2023).
2.2 Residual Norm in Attention
The residual 3 introduced by self-attention typically has much lower norm than the input 4; for RoBERTa-large, empirically 5 across layers. Thus, each attention step constitutes a small perturbation ("nudge"), leaving most representational diversity maintenance to the FFN (Sonkar et al., 2023).
3. SAF in Disaggregated LLM Inference
In transformer inference workloads, SAF—under the name Attention–FFN Disaggregation—denotes the explicit allocation of attention computation and FFN computation to separate hardware resources (Song et al., 29 Jan 2026). The motivation arises from divergent resource profiles: attention is stateful and memory-bound (KV cache operations), while FFN is stateless and FLOP-bound (intensive MLPs), particularly when batched.
3.1 System Topology
A standard deployment structure is the 6–1F topology: 7 parallel attention workers (A-instances) stream data into a single FFN worker (F-instance). The decode cycle for one step encompasses:
- Each A-subsystem computes attention over its microbatch by reading its current KV cache.
- All A-workers transmit activations to the FFN node.
- FFN processes the aggregated batch.
- Results are returned to original attention workers.
The bottleneck among attention, communication, or FFN phases determines system throughput.
4. Analytical Framework for Sizing and Throughput
The system’s efficiency depends on the provisioning ratio 8, balancing memory (attention) and compute (FFN) resources:
- Service times are modeled as:
- Attention: 9
- Communication: 0
- FFN: 1
- Each request has prefill length 2 (mean 3), and decode length 4 (geometric, mean 5).
- Batch size per A worker is 6; total context load per step is 7.
- Average token load over horizon 8:
9
Throughput per bundle (tokens per time per instance):
0
where 1.
4.1 Closed-Form Optimum
Three regimes yield stationary attention/FFN ratios:
| Regime | Throughput Maximizer |
|---|---|
| Attention–bound | 2 |
| Comm–bound | 3 |
| FFN–bound | 4 |
The overall optimum is 5 (Song et al., 29 Jan 2026).
4.2 Blocking and Idle Ratios
If 6, FFN idles; if 7, attention idles. Empirical simulation confirms that tuning 8 near 9 minimizes wasted cycles; excessive parallelism on attention increases straggler-induced stalls.
5. Empirical Comparison: SAF versus Parallel Designs
Large-scale experiments on RoBERTa-large and BERT-large-uncased pretraining followed by GLUE fine-tuning demonstrate that SAF and PAF (Parallel Attention–FFN, which applies attention and FFN in parallel and merges outputs) achieve nearly indistinguishable performance, with accuracy gaps 0 across six GLUE tasks (Sonkar et al., 2023). The following table summarizes representative results:
| Model | MRPC | STS-B | SST-2 | QNLI | QQP | MNLI | Avg. |
|---|---|---|---|---|---|---|---|
| RoBERTa-large (SAF) | 90.9 | 92.4 | 96.4 | 94.7 | 92.2 | 90.2 | 92.8 |
| RoBERTa-large (PAF) | 90.5 | 91.0 | 96.2 | 94.3 | 91.7 | 89.3 | 92.2 |
| BERT-large (SAF) | 85.0 | 89.2 | 93.5 | 92.2 | 91.4 | 86.6 | 89.6 |
| BERT-large (PAF) | 86.8 | 88.8 | 93.5 | 91.4 | 91.2 | 85.5 | 89.5 |
This validates that the sequential ordering of attention→FFN is not strictly required, provided the FFN continues to maintain isotropy and attention residuals are small.
6. Scaling Laws and Operational Guidelines
System-level scaling recommendations for SAF in disaggregated serving include (Song et al., 29 Jan 2026):
- Batch size 1: Increasing 2 favors FFN efficiency; optimum 3 typically decreases sublinearly with 4, with 5.
- Context length (6): Longer contexts demand higher 7, thus more attention-side resources.
- Model size: Both attention and FFN time constants (8) must be empirically measured per model.
- Benchmark shape constants (9) on production hardware and dynamically tune 0 to match demand and reduce idle cycles.
Practical operation mandates that attention and FFN provisioning be balanced to within 10–20% of the optimal; large deviations can halve system throughput.
7. Implications and Prospects
The SAF organization in transform models enforces a dynamic interplay: attention sublayers effectuate minimal contextual shifts ("nudges") per token, while FFN sublayers maintain the representational diversity essential for dense information flow across layers. The empirical equivalence of SAF and PAF underscores that it is the combination of small-norm attention perturbation and a sufficiently expressive, spreading FFN that is fundamental—not the specific ordering.
In inference infrastructure, SAF-style disaggregation unlocks tractable analytical throughput optimization, empirically validated with trace-calibrated simulation (Song et al., 29 Jan 2026). A plausible implication is that future LLM serving systems will increasingly adopt such microarchitectural separation, coordinated by real-time scheduling to guarantee resource balance and efficiency.
Overall, the SAF paradigm delineates both the foundational logic of transformer blocks and an operationally significant strategy for efficiently deploying LLMs at scale, with analytical tooling now available for system tuning and performance assurance.