AlphaFold EvoFormer Stack

Updated 11 January 2026

EvoFormer stack is the core transformer backbone in AlphaFold2 that iteratively updates MSA and pair representations through customized self-attention and communication modules.
It employs deep serial and parallel block architectures, integrating MSA, pair branches, and outer-product mean operations to optimize prediction accuracy.
Innovations like Parallel EvoFormer and Branch Parallelism accelerate training by enabling concurrent computation across independent tracks while preserving model performance.

The EvoFormer stack is the core transformer-based architecture within AlphaFold2, facilitating bi-directional information flow between multiple sequence alignment (MSA) embeddings and residue-pairwise features to achieve near-experimental protein structure prediction accuracy. EvoFormer integrates customized attention, non-linear transformations, and sophisticated inter-track communications repeated in deep serial stacks, forming the “trunk” of the AlphaFold2 model. Advances such as the Parallel EvoFormer and Branch Parallelism have introduced architectural modifications and training strategies that significantly accelerate end-to-end model optimization while preserving predictive performance (Wang et al., 2022).

1. Serial EvoFormer Stack in AlphaFold2

The canonical EvoFormer stack in AlphaFold2 consists of 52 sequential blocks—48 for trunk MSA + pair encoding and 4 for “extra-MSA” features—applied in each forward pass. Input representations are $M^{(0)} \in \mathbb{R}^{S \times N \times C_m}$ (MSA) and $Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ (pair), where $S$ is the number of sequences, $N$ the number of residues, $C_m$ the MSA embedding size, and $C_z$ the pair embedding size.

Each block iteratively applies three broad stages:

MSA track update: $M^{(\ell-1/2)} \leftarrow \text{EvoFormer\_MSA}(M^{(\ell-1)}, Z^{(\ell-1)})$
Pair track update: $Z^{(\ell)} \leftarrow \text{EvoFormer\_PAIR}(M^{(\ell-1/2)}, Z^{(\ell-1)})$
Cross-track communication: $M^{(\ell)}, Z^{(\ell)} \leftarrow \text{communicate}(M^{(\ell-1/2)}, Z^{(\ell)})$

In the original model, the cross-track “outer-product mean” operation occurs at the start of each block, tightly coupling the two computational tracks.

2. Core Block Architecture and Operations

Each EvoFormer block consists of three major components: the MSA branch, the pair branch, and an exchange operator.

MSA branch:

Row-wise self-attention: For each sequence $s$ , position $Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 0, multi-head self-attention with pair bias $Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 1 derived from $Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 2.

$Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 3, $Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 4, $Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 5

$Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 6

$Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 7

Column-wise self-attention: Per-residue communication across sequences.

$Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 8, similar application as above.

MSA-Transition: Gated feed-forward network (FFN):

$Z^{(0)} \in \mathbb{R}^{N \times N \times C_z}$ 9,

$S$ 0

Outer-Product Mean (MSA→pair):

After the branches (at the block end in the Parallel variant):

$S$ 1

$S$ 2

Pair branch:

Triangle-multiplicative updates: Outgoing and incoming variants:

$S$ 3, $S$ 4, $S$ 5, $S$ 6

Triangle attention: Incoming or outgoing; e.g., for incoming: fix $S$ 7, queries $S$ 8 vary with $S$ 9.

$N$ 0, $N$ 1, $N$ 2

$N$ 3

$N$ 4

Pair-Transition: $N$ 5

All blocks operate in Pre-LayerNorm style: $N$ 6, then $N$ 7.

3. Parallel EvoFormer and Pipeline Decoupling

In the Parallel EvoFormer architecture, the “outer-product mean” (OPM) is repositioned from the start to the end of each block, which leaves the MSA and pair branches fully independent within each block. This architectural reordering enables their parallel execution, removing intra-block data dependencies between MSA and pair computations. A plausible implication is that this also enhances opportunities for pipelined and distributed computation, while preserving numerical and training characteristics (Wang et al., 2022).

Block Ordering:

Original AF2 Stack	Parallel EvoFormer
OPM → MSA → Pair	MSA
→ ...	↘ OPM at end ←

4. Branch Parallelism: Distributed Training Strategy

Branch Parallelism (BP) is a distributed execution scheme exploiting the decoupling introduced by the Parallel EvoFormer. The two independent computational branches are mapped onto two separate devices or ranks:

Device 0: Processes the full MSA track (row-attn → col-attn → transition), computes $N$ 8, broadcasts $N$ 9 to Device 1.
Device 1: Performs the full pair track (triangle-mult out/in → triangle-attn out/in → transition), receives and applies $C_m$ 0, then broadcasts the updated $C_m$ 1 back to Device 0.

A final broadcast synchronizes updated $C_m$ 2 and $C_m$ 3 across both devices.

Key properties of BP:

Achieves exact parallelism of both branches, with only minor overhead from communication (broadcasts).
In the forward pass, cost equates to two blocks processed in parallel.
In the backward pass, gradient broadcasts/AllReduce are inserted to maintain correct data dependencies.
Scalability is limited by the number of independent branches (maximal factor of 2 in this design).

5. Integration with Hybrid Parallelism and Empirical Performance

Branch Parallelism can be further combined with Dynamic Axial Parallelism (DAP; tensor axis splitting) and Data-Parallelism (DP; batch splitting). For instance, an 8-GPU node may use BP=2, DAP=4, DP=8 for handling large fine-tuning tasks.

Empirical hyperparameters:

MSA-channel $C_m$ 4
Pair-channel $C_m$ 5
Number of attention heads = 8 (head dimension 32)
$C_m$ 6 (MSA depth): 128 (initial) to 512 (fine-tune)
$C_m$ 7 (residues): 256 (initial) to 384 (fine-tune)

Notable empirical results (BFloat16, 256 × A100):

Model/Phase	DP only Proteins/s	DP+BP Proteins/s	Speedup (%)
UniFold initial	30.76	42.38	+37.7
UniFold finetune	8.52	11.96	+40.4
HelixFold initial	26.01	36.05	+38.6
HelixFold finetune	7.78	10.41	+33.8

End-to-end training time is similarly reduced: for example, UniFold with DP only: 5.80 days; with BP: 4.18 days (−38.7%). HelixFold with DP only: 6.69 days; with BP: 4.88 days (−36.9%) (Wang et al., 2022).

6. Summary and Broader Implications

The AlphaFold2 EvoFormer stack, consisting of 52 serial blocks integrating self-attention, triangle operations, and cross-track communication, is computationally intensive and occupies the majority of the model’s end-to-end runtime. Architectural innovations including the Parallel EvoFormer and Branch Parallelism introduce significant efficiency improvements (∼40% in step-time speedup on 2 × A100 GPUs) by decoupling the MSA and pair tracks and co-processing them on separate devices. These modifications retain predictive accuracy on benchmarks such as CASP14 and CAMEO. A plausible implication is that such decoupling and parallel execution strategies can provide a general paradigm for speeding up other multi-branch deep learning architectures with entangled data dependencies (Wang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Efficient AlphaFold2 Training using Parallel Evoformer and Branch Parallelism (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvoFormer Stack.