Papers
Topics
Authors
Recent
Search
2000 character limit reached

AlphaFold EvoFormer Stack

Updated 11 January 2026
  • EvoFormer stack is the core transformer backbone in AlphaFold2 that iteratively updates MSA and pair representations through customized self-attention and communication modules.
  • It employs deep serial and parallel block architectures, integrating MSA, pair branches, and outer-product mean operations to optimize prediction accuracy.
  • Innovations like Parallel EvoFormer and Branch Parallelism accelerate training by enabling concurrent computation across independent tracks while preserving model performance.

The EvoFormer stack is the core transformer-based architecture within AlphaFold2, facilitating bi-directional information flow between multiple sequence alignment (MSA) embeddings and residue-pairwise features to achieve near-experimental protein structure prediction accuracy. EvoFormer integrates customized attention, non-linear transformations, and sophisticated inter-track communications repeated in deep serial stacks, forming the “trunk” of the AlphaFold2 model. Advances such as the Parallel EvoFormer and Branch Parallelism have introduced architectural modifications and training strategies that significantly accelerate end-to-end model optimization while preserving predictive performance (Wang et al., 2022).

1. Serial EvoFormer Stack in AlphaFold2

The canonical EvoFormer stack in AlphaFold2 consists of 52 sequential blocks—48 for trunk MSA + pair encoding and 4 for “extra-MSA” features—applied in each forward pass. Input representations are M(0)RS×N×CmM^{(0)} \in \mathbb{R}^{S \times N \times C_m} (MSA) and Z(0)RN×N×CzZ^{(0)} \in \mathbb{R}^{N \times N \times C_z} (pair), where SS is the number of sequences, NN the number of residues, CmC_m the MSA embedding size, and CzC_z the pair embedding size.

Each block iteratively applies three broad stages:

  1. MSA track update: M(1/2)EvoFormer_MSA(M(1),Z(1))M^{(\ell-1/2)} \leftarrow \text{EvoFormer\_MSA}(M^{(\ell-1)}, Z^{(\ell-1)})
  2. Pair track update: Z()EvoFormer_PAIR(M(1/2),Z(1))Z^{(\ell)} \leftarrow \text{EvoFormer\_PAIR}(M^{(\ell-1/2)}, Z^{(\ell-1)})
  3. Cross-track communication: M(),Z()communicate(M(1/2),Z())M^{(\ell)}, Z^{(\ell)} \leftarrow \text{communicate}(M^{(\ell-1/2)}, Z^{(\ell)})

In the original model, the cross-track “outer-product mean” operation occurs at the start of each block, tightly coupling the two computational tracks.

2. Core Block Architecture and Operations

Each EvoFormer block consists of three major components: the MSA branch, the pair branch, and an exchange operator.

MSA branch:

  • Row-wise self-attention: For each sequence ss, position ii, multi-head self-attention with pair bias bijhb_{ij}^h derived from ZZ.

Qshi=WmQmshiQ_{shi}=W_m^Q m_{shi}, Kshj=WmKmshjK_{shj}=W_m^K m_{shj}, Vshj=WmVmshjV_{shj}=W_m^V m_{shj}

Ashi,hj=Softmaxj(QshiKshjdk+bijh)A_{shi,hj} = \text{Softmax}_j\left( \frac{Q_{shi}\cdot K_{shj}}{\sqrt{d_k}} + b_{ij}^h \right)

(AttnRowOutput)shi=jAshi,hjVshj(\text{AttnRowOutput})_{shi} = \sum_j A_{shi,hj} V_{shj}

  • Column-wise self-attention: Per-residue communication across sequences.

Qshi=WcQmshiQ_{shi}=W_c^Q m_{shi}, similar application as above.

  • MSA-Transition: Gated feed-forward network (FFN):

mm+GatedFFN(LN(m))m \leftarrow m + \text{GatedFFN}(\text{LN}(m)),

GatedFFN(x)=W2(GELU(W1x)σ(Wgx))\text{GatedFFN}(x) = W_2 (\text{GELU}(W_1 x) \odot \sigma(W_g x))

Outer-Product Mean (MSA→pair):

After the branches (at the block end in the Parallel variant):

ΔZij,d=Linear_proj[1Ss=1S(ms,ims,j)]d\Delta Z_{ij,d} = \text{Linear\_proj}\left[\frac{1}{S}\sum_{s=1}^S (m_{s,i} \otimes m_{s,j})\right]_d

ZZ+LayerNormp(ΔZ)Z \leftarrow Z + \text{LayerNorm}_p(\Delta Z)

Pair branch:

  • Triangle-multiplicative updates: Outgoing and incoming variants:

ai=kzikWfa_i = \sum_k z_{ik} W_f, bj=kzkjWbb_j = \sum_k z_{kj} W_b, mij=aibjm_{ij}=a_i \odot b_j, zijzij+Womijz_{ij} \leftarrow z_{ij} + W_o m_{ij}

  • Triangle attention: Incoming or outgoing; e.g., for incoming: fix ii, queries zijz_{ij} vary with jj.

Qijh=WQzijQ_{ijh}=W^Q z_{ij}, Kikh=WKzikK_{ikh}=W^K z_{ik}, Vikh=WVzikV_{ikh}=W^V z_{ik}

Aijh,ikh=Softmaxk(QijhKikhdk)A_{ijh,ikh} = \text{Softmax}_k\left( \frac{Q_{ijh}\cdot K_{ikh}}{\sqrt{d_k}} \right)

(TriAttnIncoming)ij=kAijh,ikhVikh(\text{TriAttnIncoming})_{ij} = \sum_k A_{ijh,ikh} V_{ikh}

  • Pair-Transition: zz+GatedFFNp(LayerNormp(z))z \leftarrow z + \text{GatedFFN}_p(\text{LayerNorm}_p(z))

All blocks operate in Pre-LayerNorm style: X=LayerNorm(X)\overline{X} = \text{LayerNorm}(X), then XX+Block(X)X \leftarrow X + \text{Block}(\overline{X}).

3. Parallel EvoFormer and Pipeline Decoupling

In the Parallel EvoFormer architecture, the “outer-product mean” (OPM) is repositioned from the start to the end of each block, which leaves the MSA and pair branches fully independent within each block. This architectural reordering enables their parallel execution, removing intra-block data dependencies between MSA and pair computations. A plausible implication is that this also enhances opportunities for pipelined and distributed computation, while preserving numerical and training characteristics (Wang et al., 2022).

Block Ordering:

Original AF2 Stack Parallel EvoFormer
OPM → MSA → Pair MSA
→ ... ↘ OPM at end ←

4. Branch Parallelism: Distributed Training Strategy

Branch Parallelism (BP) is a distributed execution scheme exploiting the decoupling introduced by the Parallel EvoFormer. The two independent computational branches are mapped onto two separate devices or ranks:

  • Device 0: Processes the full MSA track (row-attn → col-attn → transition), computes ΔZ=OuterProdMean(M)\Delta Z = \text{OuterProdMean}(M), broadcasts ΔZ\Delta Z to Device 1.
  • Device 1: Performs the full pair track (triangle-mult out/in → triangle-attn out/in → transition), receives and applies ΔZ\Delta Z, then broadcasts the updated ZZ back to Device 0.

A final broadcast synchronizes updated MM' and ZZ' across both devices.

Key properties of BP:

  • Achieves exact parallelism of both branches, with only minor overhead from communication (broadcasts).
  • In the forward pass, cost equates to two blocks processed in parallel.
  • In the backward pass, gradient broadcasts/AllReduce are inserted to maintain correct data dependencies.
  • Scalability is limited by the number of independent branches (maximal factor of 2 in this design).

5. Integration with Hybrid Parallelism and Empirical Performance

Branch Parallelism can be further combined with Dynamic Axial Parallelism (DAP; tensor axis splitting) and Data-Parallelism (DP; batch splitting). For instance, an 8-GPU node may use BP=2, DAP=4, DP=8 for handling large fine-tuning tasks.

Empirical hyperparameters:

  • MSA-channel Cm=256C_m = 256
  • Pair-channel Cz=128C_z = 128
  • Number of attention heads = 8 (head dimension 32)
  • SS (MSA depth): 128 (initial) to 512 (fine-tune)
  • NN (residues): 256 (initial) to 384 (fine-tune)

Notable empirical results (BFloat16, 256 × A100):

Model/Phase DP only Proteins/s DP+BP Proteins/s Speedup (%)
UniFold initial 30.76 42.38 +37.7
UniFold finetune 8.52 11.96 +40.4
HelixFold initial 26.01 36.05 +38.6
HelixFold finetune 7.78 10.41 +33.8

End-to-end training time is similarly reduced: for example, UniFold with DP only: 5.80 days; with BP: 4.18 days (−38.7%). HelixFold with DP only: 6.69 days; with BP: 4.88 days (−36.9%) (Wang et al., 2022).

6. Summary and Broader Implications

The AlphaFold2 EvoFormer stack, consisting of 52 serial blocks integrating self-attention, triangle operations, and cross-track communication, is computationally intensive and occupies the majority of the model’s end-to-end runtime. Architectural innovations including the Parallel EvoFormer and Branch Parallelism introduce significant efficiency improvements (∼40% in step-time speedup on 2 × A100 GPUs) by decoupling the MSA and pair tracks and co-processing them on separate devices. These modifications retain predictive accuracy on benchmarks such as CASP14 and CAMEO. A plausible implication is that such decoupling and parallel execution strategies can provide a general paradigm for speeding up other multi-branch deep learning architectures with entangled data dependencies (Wang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvoFormer Stack.