Papers
Topics
Authors
Recent
Search
2000 character limit reached

Macaron-Style Half-Step FFNs

Updated 8 February 2026
  • Macaron-style half-step FFNs are a transformer modification that splits the conventional FFN into two lighter sublayers (a standard MLP and a SwiGLU block) surrounding self-attention.
  • The design employs residual scaling (½ factor) with Pre-LayerNorm to enhance gradient flow and stability in deep network stacks.
  • Empirical results on AudioMAE++ demonstrate consistent performance gains, with improvements of +3.7 to +6.8 across varied audio tasks such as ESC-50 and SpeechCommands.

Macaron-style half-step FFNs are a transformer architectural modification that decomposes the conventional feedforward sublayer into two distinct "half-step" FFNs positioned symmetrically before and after the multi-head self-attention (MHA) operation. Origination of such structures is credited to the “macaron” transformer block of Lu et al. (ICLR 2020), and their instantiation in AudioMAE++ represents a significant advancement for masked audio representation learning. In this construction, each transformer layer sandwiches MHA between a standard MLP-based FFN and a gated SwiGLU FFN, each accessed via a residual path with scaling, to achieve greater depth and nonlinearity without destabilizing optimization. When applied to masked audio autoencoders, this layered scheme produces consistent downstream gains in audio classification and other benchmarks (Yadav et al., 14 Jul 2025).

1. Layer Structure and Data Flow

The AudioMAE++ encoder and decoder layers adopt a macaron-style variant with the following processing sequence, for an input token representation x(i)Rdx^{(i)} \in \mathbb{R}^d at layer \ell:

  1. First half-step FFN (standard 1-layer MLP with GeLU activation) is applied to the normalized input, scaled by ½½, then added residually:

x=x+½ FFN1(LN(x))x' = x + ½~\mathrm{FFN}_1(\mathrm{LN}(x))

  1. Multi-head self-attention (MHA) is applied to the normalized result and added residually:

x=x+MHA(LN(x))x'' = x' + \mathrm{MHA}(\mathrm{LN}(x'))

  1. Second half-step FFN (SwiGLU block) is applied to the normalized result, scaled by ½½, then added residually:

y=LN(x+½ FFN2(LN(x)))y = \mathrm{LN}(x'' + ½~\mathrm{FFN}_2(\mathrm{LN}(x'')))

  1. Final LayerNorm is applied before output.

This architecture contrasts with the standard transformer "Post-LN" or "Pre-LN" block, which applies a single full-strength FFN after attention. Macaron-style half-step FFNs thus "sandwich" MHA between two lighter FFN sublayers, promoting better flow of gradients and richer expressiveness (Yadav et al., 14 Jul 2025).

2. Mathematical Formulation

Each half-step FFN adopts distinct mathematical forms:

  • First half-step FFN (FFN₁):

FFN1(u)=W2GeLU(W1u+b1)+b2\mathrm{FFN}_1(u) = W_2\,\mathrm{GeLU}(W_1 u + b_1) + b_2

with W1Rdm×dffW_1 \in \mathbb{R}^{d_m \times d_{ff}}, W2Rdff×dmW_2 \in \mathbb{R}^{d_{ff} \times d_m}, b1Rdffb_1 \in \mathbb{R}^{d_{ff}}, b2Rdmb_2 \in \mathbb{R}^{d_m}.

  • Second half-step FFN (FFN₂, SwiGLU):

FFN2(u)=O  [  Swish(uWg+bg)  (uV+bv)  ]\mathrm{FFN}_2(u) = O\;[\;\mathrm{Swish}(u W_g + b_g)\ \odot\ (uV + b_v)\;]

with Wg,VRdm×dffW_g, V \in \mathbb{R}^{d_m \times d_{ff}}, bg,bvRdffb_g, b_v \in \mathbb{R}^{d_{ff}}, ORdff×dmO \in \mathbb{R}^{d_{ff} \times d_m}.

Activation functions are GeLU(z)\mathrm{GeLU}(z) and Swish(z)=zσ(z)\mathrm{Swish}(z) = z \cdot \sigma(z), with \odot denoting elementwise multiplication. Both sublayers input and output Rdm\mathbb{R}^{d_m} vectors.

3. Distinctions from Vanilla and Classic Macaron Blocks

Three critical distinctions arise in AudioMAE++ relative to earlier designs:

  • Asymmetric FFN composition: The classic macaron transformer employs two identical standard FFNs; AudioMAE++ replaces the second half-step with a SwiGLU-gated FFN, introducing stronger nonlinearity and gating.
  • Pre-LayerNorm everywhere and uniform residual scaling: Each sublayer is preceded by LayerNorm (“Pre-LN”); each half-step FFN's output is added via a residual connection scaled by ½½ to prevent training instability in deep stacks.
  • Tokenization and patching for audio: Audio inputs are 2s mel-spectrograms (200×80), patched into non-overlapping 4×16 bins, then flattened and linearly projected to model dimension dmd_m, echoing Vision Transformer (ViT) patching but aligned to the time-frequency domain.

4. Empirical Impact and Performance

The adoption of macaron-style half-step FFNs with SwiGLU yields measurable improvements on task-averaged downstream performance:

  • On a 10-task HEAR suite (including ESC-50, SpeechCommands, pitch and instrument classification), the AudioMAE++-Base model (macaron + SwiGLU, 141.9M parameters) achieves an aggregated normalized score s(m)=91.8s(m) = 91.8, compared to 88.1 for the standard single-FFN MAE-Base (85.1M parameters), a +3.7 absolute improvement.
  • On smaller models (\sim8.9M parameters), macaron+SwiGLU scores 78.8 vs 77.6 for the analogous MAE-Tiny.
  • At larger scale, improvement is approximately +6.8 (from 86.9 to 93.7).
  • These gains are present across diverse tasks (music, environmental sound, speech keyword spotting, pitch inference) (Yadav et al., 14 Jul 2025).

5. Data Flow and Optimization Strategies

The following describes the sequence within each macaron-style transformer++ block:

  1. Incoming vector xx is LayerNorm’d, passed through a lightweight GeLU-activated FFN, halved, and added back.
  2. Self-attention is applied to the next LayerNorm’d output, then added residual (full strength).
  3. The output is LayerNorm’d, processed by the SwiGLU FFN, halved, and added back.
  4. Final LayerNorm is performed prior to the next layer.

Residual scaling (½½ factor) is used only for the half-step FFNs and not for attention, contributing to stability in deep stacked models. No dropout is explicitly used in FFNs or attention. The decoder mirrors the encoder's macaron-style block but employs a lower model dimension.

6. Hyperparameters and Training

AudioMAE++ instantiates macaron-style half-step FFNs with the following configurations:

Model Variant dmd_m (model dim) dffd_{ff} (inner dim) Encoder Layers Decoder Layers (ddecd_{dec})
Tiny 192 768 12 4 (384)
Base 768 3072 12 4 (384)
Large 1024 4096 24 4 (512)
  • In pretraining, 80% of spectrogram patches are masked in the encoder.
  • AdamW optimizer is used (weight decay 0.05, batch size 1024, 100 epochs, learning rate: linear warmup over 10 epochs to cosine decay).
  • Both encoder and decoder layers use macaron+SwiGLU blocks, with the decoder at reduced dimensionality (ddecd_{dec}).

7. Significance and Applicability

By splitting the pointwise FFN into two half-step updates bracketing self-attention, and integrating a SwiGLU-gated block post-attention, the macaron-style half-step FFN architecture supports deeper, more expressive transformers for masked audio autoencoding. Empirical gains are consistent and amplify at scale, indicating that such modifications are especially advantageous for large models tackling self-supervised audio representation learning. Adoption of heterogeneous FFNs, systematic Pre-LN, and residual scaling distinguishes this methodology from both vanilla and classical macaron transformer variants, with direct applicability to time-frequency-patch input domains (Yadav et al., 14 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Macaron-Style Half-Step FFNs.