Macaron-Style Half-Step FFNs

Updated 8 February 2026

Macaron-style half-step FFNs are a transformer modification that splits the conventional FFN into two lighter sublayers (a standard MLP and a SwiGLU block) surrounding self-attention.
The design employs residual scaling (½ factor) with Pre-LayerNorm to enhance gradient flow and stability in deep network stacks.
Empirical results on AudioMAE++ demonstrate consistent performance gains, with improvements of +3.7 to +6.8 across varied audio tasks such as ESC-50 and SpeechCommands.

Macaron-style half-step FFNs are a transformer architectural modification that decomposes the conventional feedforward sublayer into two distinct "half-step" FFNs positioned symmetrically before and after the multi-head self-attention (MHA) operation. Origination of such structures is credited to the “macaron” transformer block of Lu et al. (ICLR 2020), and their instantiation in AudioMAE++ represents a significant advancement for masked audio representation learning. In this construction, each transformer layer sandwiches MHA between a standard MLP-based FFN and a gated SwiGLU FFN, each accessed via a residual path with scaling, to achieve greater depth and nonlinearity without destabilizing optimization. When applied to masked audio autoencoders, this layered scheme produces consistent downstream gains in audio classification and other benchmarks (Yadav et al., 14 Jul 2025).

1. Layer Structure and Data Flow

The AudioMAE++ encoder and decoder layers adopt a macaron-style variant with the following processing sequence, for an input token representation $x^{(i)} \in \mathbb{R}^d$ at layer $\ell$ :

First half-step FFN (standard 1-layer MLP with GeLU activation) is applied to the normalized input, scaled by $½$ , then added residually:

$x' = x + ½~\mathrm{FFN}_1(\mathrm{LN}(x))$

Multi-head self-attention (MHA) is applied to the normalized result and added residually:

$x'' = x' + \mathrm{MHA}(\mathrm{LN}(x'))$

Second half-step FFN (SwiGLU block) is applied to the normalized result, scaled by $½$ , then added residually:

$y = \mathrm{LN}(x'' + ½~\mathrm{FFN}_2(\mathrm{LN}(x'')))$

Final LayerNorm is applied before output.

This architecture contrasts with the standard transformer "Post-LN" or "Pre-LN" block, which applies a single full-strength FFN after attention. Macaron-style half-step FFNs thus "sandwich" MHA between two lighter FFN sublayers, promoting better flow of gradients and richer expressiveness (Yadav et al., 14 Jul 2025).

2. Mathematical Formulation

Each half-step FFN adopts distinct mathematical forms:

First half-step FFN (FFN₁):

$\mathrm{FFN}_1(u) = W_2\,\mathrm{GeLU}(W_1 u + b_1) + b_2$

with $W_1 \in \mathbb{R}^{d_m \times d_{ff}}$ , $W_2 \in \mathbb{R}^{d_{ff} \times d_m}$ , $b_1 \in \mathbb{R}^{d_{ff}}$ , $b_2 \in \mathbb{R}^{d_m}$ .

Second half-step FFN (FFN₂, SwiGLU):

$\mathrm{FFN}_2(u) = O\;[\;\mathrm{Swish}(u W_g + b_g)\ \odot\ (uV + b_v)\;]$

with $W_g, V \in \mathbb{R}^{d_m \times d_{ff}}$ , $b_g, b_v \in \mathbb{R}^{d_{ff}}$ , $O \in \mathbb{R}^{d_{ff} \times d_m}$ .

Activation functions are $\mathrm{GeLU}(z)$ and $\mathrm{Swish}(z) = z \cdot \sigma(z)$ , with $\odot$ denoting elementwise multiplication. Both sublayers input and output $\mathbb{R}^{d_m}$ vectors.

3. Distinctions from Vanilla and Classic Macaron Blocks

Three critical distinctions arise in AudioMAE++ relative to earlier designs:

Asymmetric FFN composition: The classic macaron transformer employs two identical standard FFNs; AudioMAE++ replaces the second half-step with a SwiGLU-gated FFN, introducing stronger nonlinearity and gating.
Pre-LayerNorm everywhere and uniform residual scaling: Each sublayer is preceded by LayerNorm (“Pre-LN”); each half-step FFN's output is added via a residual connection scaled by $½$ to prevent training instability in deep stacks.
Tokenization and patching for audio: Audio inputs are 2s mel-spectrograms (200×80), patched into non-overlapping 4×16 bins, then flattened and linearly projected to model dimension $d_m$ , echoing Vision Transformer (ViT) patching but aligned to the time-frequency domain.

4. Empirical Impact and Performance

The adoption of macaron-style half-step FFNs with SwiGLU yields measurable improvements on task-averaged downstream performance:

On a 10-task HEAR suite (including ESC-50, SpeechCommands, pitch and instrument classification), the AudioMAE++-Base model (macaron + SwiGLU, 141.9M parameters) achieves an aggregated normalized score $s(m) = 91.8$ , compared to 88.1 for the standard single-FFN MAE-Base (85.1M parameters), a +3.7 absolute improvement.
On smaller models ( $\sim$ 8.9M parameters), macaron+SwiGLU scores 78.8 vs 77.6 for the analogous MAE-Tiny.
At larger scale, improvement is approximately +6.8 (from 86.9 to 93.7).
These gains are present across diverse tasks (music, environmental sound, speech keyword spotting, pitch inference) (Yadav et al., 14 Jul 2025).

5. Data Flow and Optimization Strategies

The following describes the sequence within each macaron-style transformer++ block:

Incoming vector $x$ is LayerNorm’d, passed through a lightweight GeLU-activated FFN, halved, and added back.
Self-attention is applied to the next LayerNorm’d output, then added residual (full strength).
The output is LayerNorm’d, processed by the SwiGLU FFN, halved, and added back.
Final LayerNorm is performed prior to the next layer.

Residual scaling ( $½$ factor) is used only for the half-step FFNs and not for attention, contributing to stability in deep stacked models. No dropout is explicitly used in FFNs or attention. The decoder mirrors the encoder's macaron-style block but employs a lower model dimension.

6. Hyperparameters and Training

AudioMAE++ instantiates macaron-style half-step FFNs with the following configurations:

Model Variant	$d_m$ (model dim)	$d_{ff}$ (inner dim)	Encoder Layers	Decoder Layers ( $d_{dec}$ )
Tiny	192	768	12	4 (384)
Base	768	3072	12	4 (384)
Large	1024	4096	24	4 (512)

In pretraining, 80% of spectrogram patches are masked in the encoder.
AdamW optimizer is used (weight decay 0.05, batch size 1024, 100 epochs, learning rate: linear warmup over 10 epochs to cosine decay).
Both encoder and decoder layers use macaron+SwiGLU blocks, with the decoder at reduced dimensionality ( $d_{dec}$ ).

7. Significance and Applicability

By splitting the pointwise FFN into two half-step updates bracketing self-attention, and integrating a SwiGLU-gated block post-attention, the macaron-style half-step FFN architecture supports deeper, more expressive transformers for masked audio autoencoding. Empirical gains are consistent and amplify at scale, indicating that such modifications are especially advantageous for large models tackling self-supervised audio representation learning. Adoption of heterogeneous FFNs, systematic Pre-LN, and residual scaling distinguishes this methodology from both vanilla and classical macaron transformer variants, with direct applicability to time-frequency-patch input domains (Yadav et al., 14 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

AudioMAE++: learning better masked audio representations with SwiGLU FFNs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Macaron-Style Half-Step FFNs.

Macaron-Style Half-Step FFNs

1. Layer Structure and Data Flow

2. Mathematical Formulation

3. Distinctions from Vanilla and Classic Macaron Blocks

4. Empirical Impact and Performance

5. Data Flow and Optimization Strategies

6. Hyperparameters and Training

7. Significance and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Macaron-Style Half-Step FFNs

1. Layer Structure and Data Flow

2. Mathematical Formulation

3. Distinctions from Vanilla and Classic Macaron Blocks

4. Empirical Impact and Performance

5. Data Flow and Optimization Strategies

6. Hyperparameters and Training

7. Significance and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research