Gemma-style Transformer Block

Updated 25 November 2025

Gemma-style Transformer blocks are a variant of the decoder-only Transformer that interleaves local and global attention with grouped-query projections and gated MLPs for scalability.
They employ dual RMSNorm and rotary position embeddings to stabilize deep architectures and optimize long-context language modeling in Gemma 2 and Gemma 3 models.
Empirical and theoretical studies demonstrate that these blocks can reduce computational complexity by up to 25% while enabling precise in-context parameter updates.

A Gemma-style Transformer block is an architectural variant of the decoder-only Transformer, introduced for scalable, efficient, and high-context language modeling in the Gemma family. These blocks are characterized by systematic architectural modifications including interleaved local-global attention, grouped-query attention (GQA), GeGLU or GLU-MLP nonlinearities, and pervasive RMSNorm-based normalization. The block structure supports both computational efficiency and robust empirical performance across diverse large model settings, and underpins both the Gemma 2 and Gemma 3 model series (Team et al., 31 Jul 2024, Team et al., 25 Mar 2025). Recent theory establishes a closed-form, blockwise equivalence between the contextual effect in Gemma-style blocks and explicit low-rank updates to their MLP and normalization parameters (Goldwaser et al., 22 Nov 2025).

1. Formal Structure of Gemma-Style Transformer Blocks

A Gemma-style block consists of a layer-normalized attention sublayer (interleaving local and global context), followed by a layer-normalized two-layer MLP with a gated nonlinearity, with residual connections surrounding both sublayers. RMS normalization can be applied before and after the major submodules (pre-norm and post-norm, respectively).

The canonical block-level computation is:

Input $X \in \mathbb{R}^{L \times d}$ (sequence length $L$ , hidden dim $d$ )
Pre-RMSNorm: $\hat{X} = \text{RMSNorm}(X)$
Self-Attention, alternating local/global sparsity
Residual: $X' = X + \text{SelfAttn}(\hat{X})$
Post-RMSNorm: $\widetilde{X} = \text{RMSNorm}(X')$
Feed-forward: $M = W_2 \cdot (\text{GeLU}(W_{1,a}\widetilde{X}) \odot (W_{1,b}\widetilde{X}))$
Output residual: $X'' = X' + M$ (Team et al., 31 Jul 2024)

In Gemma 3, GLU/GeGLU variants and gating are prominent (Team et al., 25 Mar 2025); all versions apply rotary position embeddings to attention projections.

2. Attention Schemes: Interleaved Local/Global, Grouped-Query

Gemma-style blocks alternate between local sliding-window and global self-attention patterns across layers. For a local layer, each query attends only to tokens within a window $w$ (e.g., $w=4096$ in Gemma 2, $w=1024$ in Gemma 3). Global layers use standard causal attention over $L$ tokens. This pattern dramatically reduces computational complexity and memory overhead, especially for very long contexts. The ratio $r$ of local to global layers is tunable (e.g., $r=5$ in Gemma 3).

Grouped-Query Attention (GQA) projects queries via $h$ heads but keys/values via $g < h$ groups, reducing projection parameters and attention computation. Typically, Gemma models use $g=h/2$ (Team et al., 31 Jul 2024).

The following table summarizes attention attributes:

Attribute	Local Layer	Global Layer
Attn mask	Sliding window $w$	Causal
Complexity	$O(Lw)$	$O(L^2)$
RoPE frequency	$\alpha_{loc}=10^4$	$\alpha_{glo}=10^6$

3. Nonlinearity and MLP: GeGLU/GLU, Rank-1 Patchability

The MLP sublayer employs gated activations of the form

$f(u,g) = \text{GeLU}(u) \odot g,$

where $u = W_{gate} z$ , $g = W_{up} z$ , and $z$ is the output of layer normalization (Goldwaser et al., 22 Nov 2025). The GeGLU or variant GLU structure increases expressivity. Typically, $d_{ff}$ is a large multiple of $d$ (e.g., $d_{ff}/d \approx 8$ ).

A key theoretical result proves that, for a given context $C$ and input $x$ , this block can be exactly matched by applying explicit, token-dependent, rank-1 patches $\Delta W_{gate}$ , $\Delta W_{up}$ and a vector patch $\Delta m$ to the block’s MLP and RMSNorm scaling, such that the output computed on $x$ with no context is identical to the context-induced output. The closed-form updates are

$\begin{align*} \Delta W_{gate} &= [ W_{gate}(z_C - z)]\, z^T/ \|z\|^2 \ \Delta W_{up} &= [ W_{up}\,(z_C - z)]\, z^T/ \|z\|^2 \ \Delta m &= (v_C - v)\, \oslash\, f(W_{gate} z_C, W_{up} z_C) \end{align*}$

where $z_C$ denotes the normalized post-attention residual for context $C$ , $z$ for the empty context (Goldwaser et al., 22 Nov 2025). The patching requires only that $z \neq 0$ and $f(W_{gate} z_C, W_{up} z_C)_i \neq 0$ for all $i$ .

4. Normalization and Residual Pathways

Gemma-style blocks standardize on RMS normalization (RMSNorm) before the attention and MLP sublayers and often also after each sublayer (dual norm or post-norm). This stabilizes optimization, especially in deep stacks up to 46 layers, and interacts favorably with residual connections.

The normalization is defined as

$\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2 + \epsilon}} \cdot \gamma$

with $\gamma$ a learned scaling vector (Team et al., 31 Jul 2024). In-context effects on the RMSNorm scale vector can be captured precisely by the vector patch $\Delta m$ as detailed in section 3 (Goldwaser et al., 22 Nov 2025).

5. Computational and Memory Efficiency

Gemma-style blocks deliver architectural and resource efficiency relative to vanilla transformers:

Attention sparsity: Alternating local/global attention reduces average per-layer complexity to $0.75 L^2$ ( $L$ = context length), a $\sim25\%$ reduction in FLOPs (Team et al., 31 Jul 2024).
Grouped projections: GQA reduces attention parameter count and compute by $\sim25\%$ by reducing $K$ / $V$ projections.
KV-cache optimization: Local layers only retain the $w$ -most recent tokens in the key/value cache, whereas global layers require the entire past $L$ . In Gemma 3, the typical overall memory use for $r=5$ , $L=32,768$ , $w=1024$ is approximately $19\%$ of a full-global architecture (Team et al., 25 Mar 2025).

6. Applications and Empirical Evaluation

These blocks underpin the core stack of Gemma 2 and Gemma 3 models deployed for language, multimodal, and scientific tasks (Team et al., 31 Jul 2024, Team et al., 25 Mar 2025). The Gemma 3 models, with these blocks, demonstrate effective handling of long contexts (up to 128K tokens), robust performance across languages and tasks, and improved inference efficiency. Downstream modular applications, such as freezing mid-layer Gemma blocks for data-efficient transfer in custom tabular tasks (e.g., wildfire prediction), further illustrate their utility as transferable “internal world” representations (Jadouli et al., 20 Apr 2025).

Empirical evaluation in standard benchmarks reveals that Gemma-style models with these blocks match or surpass prior open models at equivalent or larger scales, both in absolute accuracy and efficiency. Ablations consistently confirm that both attention sparsity and GQA components deliver gains without cost to model capability (Team et al., 31 Jul 2024).

7. Theoretical Implications and Generalization

The exact analytical mapping between contextual effects and MLP/rank-1 parameter updates, demonstrated for Gemma-style blocks, generalizes broadly to models with input-controllable inner and output-controllable outer functions in their MLP blocks (Goldwaser et al., 22 Nov 2025). This result provides a unifying lens for interpreting in-context computation as implicit weight patching and extends to diverse architectures using gating, pre-/post-norm, and mixture-of-experts routing.

A plausible implication is that the expressivity and flexibility of Gemma-style blocks are tightly linked to the controllability afforded by RMSNorm and gated MLPs, explaining both their empirical robustness and theoretical tractability in the context of prompt-based model adaptation.