Series Attention–FFN (SAF)

Updated 11 June 2026

Series Attention–FFN (SAF) is a transformer block where a self-attention sublayer is sequentially followed by a feed-forward network, each with residual connections and normalization.
It maintains representational diversity by using small-norm attention residuals and FFN-induced isotropy preservation, ensuring effective token embedding re-spread.
SAF enables disaggregated LLM inference by assigning memory-bound attention and compute-bound FFN tasks to specialized hardware, optimizing overall system throughput.

Series Attention–Feed-Forward (SAF), also known as Attention–FFN, refers both to a canonical architectural block in stacked transformer models and to an emerging paradigm for disaggregating LLM serving workloads across specialized hardware resources. In its classical form, SAF denotes the ordered sequence in which a self-attention sublayer is followed by a feed-forward network (FFN) sublayer, each encapsulated by a residual connection and layer normalization. As a distribution strategy for inference acceleration, SAF (or Attention–FFN Disaggregation) separates memory-bound KV-cache-dominated attention computation from stateless compute-intensive FFN computation, enabling independent scaling and optimization of hardware resources.

1. Formal Architecture of the SAF Layer

Let $X_l\in\mathbb{R}^{n\times d}$ denote the matrix of $n$ token embeddings at the input of layer $l$ . The Series Attention–Feed-Forward (SAF) layer is composed of two ordered subcomponents: a multi-headed self-attention mechanism ( $A_l$ ), and a position-wise two-layer FFN ( $F_l$ ), each followed by addition with the input ("residual"), then layer normalization (LN):

$\begin{aligned} Y_l &= \mathrm{LayerNorm}\bigl(X_l + A_l(X_l)\bigr) \ X_{l+1} &= \mathrm{LayerNorm}\bigl(Y_l + F_l(Y_l)\bigr) \end{aligned}$

Here, $A_l : \mathbb{R}^{n\times d}\to\mathbb{R}^{n\times d}$ is the self-attention function; $F_l: \mathbb{R}^{n\times d}\to\mathbb{R}^{n\times d}$ is the FFN. The update is strictly sequential: first, self-attention contextualizes input tokens; second, the FFN reprojects the resulting representations. This sequential structure is preserved in all major transformer variants, including RoBERTa-large and BERT-large-uncased (Sonkar et al., 2023).

Pseudo-code for a single SAF block is: $\begin{aligned} Y_l &= \mathrm{LayerNorm}\bigl(X_l + A_l(X_l)\bigr) \ X_{l+1} &= \mathrm{LayerNorm}\bigl(Y_l + F_l(Y_l)\bigr) \end{aligned}$ 1 An ASCII schematic highlights the residual boundaries and flow.

2. Theoretical Basis: Isotropy and Residual Norms

2.1 Role of FFN: Isotropy Preservation

Deep stacks of self-attention, when deployed without FFN or residual additions, exhibit a collapse of token embeddings into near-uniform directions (loss of isotropy). This is formally measured by isotropy:

$I(E) = \frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n \frac{E_i^T E_j}{\|E_i\|\|E_j\|}\in [-1,1]$

where $E_i$ is the $n$ 0th token embedding. $n$ 1 indicates collapse. SAF’s FFN re-spreads embeddings per layer, maintaining low $n$ 2, while omission of FFN yields rapid isotropic degeneracy (Sonkar et al., 2023).

2.2 Residual Norm in Attention

The residual $n$ 3 introduced by self-attention typically has much lower norm than the input $n$ 4; for RoBERTa-large, empirically $n$ 5 across layers. Thus, each attention step constitutes a small perturbation ("nudge"), leaving most representational diversity maintenance to the FFN (Sonkar et al., 2023).

3. SAF in Disaggregated LLM Inference

In transformer inference workloads, SAF—under the name Attention–FFN Disaggregation—denotes the explicit allocation of attention computation and FFN computation to separate hardware resources (Song et al., 29 Jan 2026). The motivation arises from divergent resource profiles: attention is stateful and memory-bound (KV cache operations), while FFN is stateless and FLOP-bound (intensive MLPs), particularly when batched.

3.1 System Topology

A standard deployment structure is the $n$ 6–1F topology: $n$ 7 parallel attention workers (A-instances) stream data into a single FFN worker (F-instance). The decode cycle for one step encompasses:

Each A-subsystem computes attention over its microbatch by reading its current KV cache.
All A-workers transmit activations to the FFN node.
FFN processes the aggregated batch.
Results are returned to original attention workers.

The bottleneck among attention, communication, or FFN phases determines system throughput.

4. Analytical Framework for Sizing and Throughput

The system’s efficiency depends on the provisioning ratio $n$ 8, balancing memory (attention) and compute (FFN) resources:

Service times are modeled as:
- Attention: $n$ 9
- Communication: $l$ 0
- FFN: $l$ 1
Each request has prefill length $l$ 2 (mean $l$ 3), and decode length $l$ 4 (geometric, mean $l$ 5).
Batch size per A worker is $l$ 6; total context load per step is $l$ 7.
Average token load over horizon $l$ 8:

$l$ 9

Throughput per bundle (tokens per time per instance):

$A_l$ 0

where $A_l$ 1.

4.1 Closed-Form Optimum

Three regimes yield stationary attention/FFN ratios:

Regime	Throughput Maximizer
Attention–bound	$A_l$ 2
Comm–bound	$A_l$ 3
FFN–bound	$A_l$ 4

The overall optimum is $A_l$ 5 (Song et al., 29 Jan 2026).

4.2 Blocking and Idle Ratios

If $A_l$ 6, FFN idles; if $A_l$ 7, attention idles. Empirical simulation confirms that tuning $A_l$ 8 near $A_l$ 9 minimizes wasted cycles; excessive parallelism on attention increases straggler-induced stalls.

5. Empirical Comparison: SAF versus Parallel Designs

Large-scale experiments on RoBERTa-large and BERT-large-uncased pretraining followed by GLUE fine-tuning demonstrate that SAF and PAF (Parallel Attention–FFN, which applies attention and FFN in parallel and merges outputs) achieve nearly indistinguishable performance, with accuracy gaps $F_l$ 0 across six GLUE tasks (Sonkar et al., 2023). The following table summarizes representative results:

Model	MRPC	STS-B	SST-2	QNLI	QQP	MNLI	Avg.
RoBERTa-large (SAF)	90.9	92.4	96.4	94.7	92.2	90.2	92.8
RoBERTa-large (PAF)	90.5	91.0	96.2	94.3	91.7	89.3	92.2
BERT-large (SAF)	85.0	89.2	93.5	92.2	91.4	86.6	89.6
BERT-large (PAF)	86.8	88.8	93.5	91.4	91.2	85.5	89.5

This validates that the sequential ordering of attention→FFN is not strictly required, provided the FFN continues to maintain isotropy and attention residuals are small.

6. Scaling Laws and Operational Guidelines

System-level scaling recommendations for SAF in disaggregated serving include (Song et al., 29 Jan 2026):

Batch size $F_l$ 1: Increasing $F_l$ 2 favors FFN efficiency; optimum $F_l$ 3 typically decreases sublinearly with $F_l$ 4, with $F_l$ 5.
Context length ( $F_l$ 6): Longer contexts demand higher $F_l$ 7, thus more attention-side resources.
Model size: Both attention and FFN time constants ( $F_l$ 8) must be empirically measured per model.
Benchmark shape constants ( $F_l$ 9) on production hardware and dynamically tune $\begin{aligned} Y_l &= \mathrm{LayerNorm}\bigl(X_l + A_l(X_l)\bigr) \ X_{l+1} &= \mathrm{LayerNorm}\bigl(Y_l + F_l(Y_l)\bigr) \end{aligned}$ 0 to match demand and reduce idle cycles.

Practical operation mandates that attention and FFN provisioning be balanced to within 10–20% of the optimal; large deviations can halve system throughput.

7. Implications and Prospects

The SAF organization in transform models enforces a dynamic interplay: attention sublayers effectuate minimal contextual shifts ("nudges") per token, while FFN sublayers maintain the representational diversity essential for dense information flow across layers. The empirical equivalence of SAF and PAF underscores that it is the combination of small-norm attention perturbation and a sufficiently expressive, spreading FFN that is fundamental—not the specific ordering.

In inference infrastructure, SAF-style disaggregation unlocks tractable analytical throughput optimization, empirically validated with trace-calibrated simulation (Song et al., 29 Jan 2026). A plausible implication is that future LLM serving systems will increasingly adopt such microarchitectural separation, coordinated by real-time scheduling to guarantee resource balance and efficiency.

Overall, the SAF paradigm delineates both the foundational logic of transformer blocks and an operationally significant strategy for efficiently deploying LLMs at scale, with analytical tooling now available for system tuning and performance assurance.

Markdown Report Issue Upgrade to Chat

References (2)

Investigating the Role of Feed-Forward Networks in Transformers Using Parallel Attention and Feed-Forward Net Design (2023)

Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Series Attention-FFN (SAF).

Series Attention–FFN (SAF)

1. Formal Architecture of the SAF Layer

2. Theoretical Basis: Isotropy and Residual Norms

2.1 Role of FFN: Isotropy Preservation

2.2 Residual Norm in Attention

3. SAF in Disaggregated LLM Inference

3.1 System Topology

4. Analytical Framework for Sizing and Throughput

4.1 Closed-Form Optimum

4.2 Blocking and Idle Ratios

5. Empirical Comparison: SAF versus Parallel Designs

6. Scaling Laws and Operational Guidelines

7. Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Series Attention–FFN (SAF)

1. Formal Architecture of the SAF Layer

2. Theoretical Basis: Isotropy and Residual Norms

2.1 Role of FFN: Isotropy Preservation

2.2 Residual Norm in Attention

3. SAF in Disaggregated LLM Inference

3.1 System Topology

4. Analytical Framework for Sizing and Throughput

4.1 Closed-Form Optimum

4.2 Blocking and Idle Ratios

5. Empirical Comparison: SAF versus Parallel Designs

6. Scaling Laws and Operational Guidelines

7. Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research