WavLM Base+ Architecture

Updated 11 April 2026

WavLM Base+ is a self-supervised speech model that integrates masked speech prediction and denoising to enhance multi-speaker awareness.
It uses a convolutional feature encoder and a 12-layer Transformer with a gated relative position bias mechanism to capture complex sequential dependencies.
Pre-trained on a diverse Mix-94k h dataset, the model delivers significant improvements in speech recognition and paralinguistic tasks.

WavLM Base+ is a self-supervised pre-trained speech model designed to unify content modeling, denoising, and multi-speaker awareness within a single architecture. Building upon the HuBERT framework, it introduces novel masked speech prediction and denoising objectives, employs a gated relative position bias mechanism in its Transformer encoder, and leverages large-scale, diverse training data to advance full-stack speech processing tasks (Chen et al., 2021).

1. Core Architectural Components

WavLM Base+ comprises a convolutional feature encoder and a Transformer-based encoder. The convolutional frontend consists of 7 temporal-convolutional blocks, each with 512 output channels. The kernel widths are [10, 3, 3, 3, 3, 2, 2], and the strides are [5, 2, 2, 2, 2, 2, 2], yielding 512-dimensional features every 20 ms, covering approximately 25 ms of input. After each convolution, LayerNorm and GELU activation are applied.

The Transformer encoder in WavLM Base+ comprises 12 layers, each with a model dimension $d_{model} = 768$ , a feed-forward inner dimension $d_{ff} = 3072$ , and 8 attention heads ( $d_k = 96$ per head). Dropout of 0.1 is applied to both attention weights and FFN output, mirroring HuBERT's configuration. The model adopts pre-norm LayerNorm (preceding each sub-layer) and includes a final LayerNorm after the last block. The overall parameter count for WavLM Base+ is approximately 94.7 million.

2. Gated Relative Position Bias Mechanism

In each Transformer block, WavLM replaces conventional absolute or convolutional relative biases with a gated relative position bias as introduced in XLM-E. For each pair of positions $(i, j)$ , a "bucketed" relative index $|i-j|$ is mapped via T5-style logarithmic bucketing ( $n=320$ buckets, $m=800$ maximum) to a shared embedding $d_{i-j}$ . Two scalar gates per query ( $g^{up}$ and $g^{res}$ ) are computed:

$d_{ff} = 3072$ 0

where $d_{ff} = 3072$ 1 is the query projection of the $d_{ff} = 3072$ 2-th hidden state and $d_{ff} = 3072$ 3 are learned vectors.

An intermediate bias is defined as

$d_{ff} = 3072$ 4

where $d_{ff} = 3072$ 5 is a learned scalar. The final gated bias is

$d_{ff} = 3072$ 6

The self-attention logits incorporate this bias: $d_{ff} = 3072$ 7 The attention weights are computed as usual. This approach enables content-dependent modulation of positional bias, enhancing the model's capacity to encode complex sequential dependencies in speech.

3. Dual Pre-Training Objectives: Masked Speech Prediction and Denoising

WavLM Base+ extends the masked prediction paradigm established by HuBERT by integrating denoising objectives. Approximately 20% of pre-training utterances are artificially corrupted by mixing the input speech $d_{ff} = 3072$ 8 (in a random 50%-length segment) either with another utterance (SNR $d_{ff} = 3072$ 9 Uniform(–5, 5) dB) or with a DNS noise clip (SNR $d_k = 96$ 0 Uniform(–5, 20) dB). In all cases, 8% of time steps (in spans of 10 frames) are masked as in HuBERT.

During training, the corrupted input $d_k = 96$ 1 serves as input while the model is tasked with predicting the original pseudo-label cluster index $d_k = 96$ 2 at masked positions using a cross-entropy loss over $d_k = 96$ 3 cluster codes: $d_k = 96$ 4 No additional denoising head is added; denoising capability emerges from the model's forced recovery of clean pseudo-labels given corrupted inputs.

4. Comparative Configurations: Base, Base+, and Large

WavLM’s architectural variants are distinguished by scale and data used in pre-training:

Variant	Layers	$d_k = 96$ 5	Heads	Params (M)	Pre-training Data	Denoising
Base	12	768	8	94.7	LibriLight 960 h	No
Base+	12	768	8	94.7	Mix 94k h	Yes
Large	24	1024	12	316.6	Mix 94k h	Yes

Base+ matches Base in architecture but distinguishes itself by pre-training on a larger, more diverse Mix-94k corpus (LibriLight, GigaSpeech, VoxPopuli), for 1M steps, and by using 20% noisy/overlapped-speech simulation. The Large variant scales to 24 layers, $d_k = 96$ 6, and 316.6 million parameters.

5. Layer-Wise Component Summary

The organization of WavLM Base+ structural stages is shown below:

Stage	Output Dim	Kernel/Heads	$d_k = 96$ 7	Layers	Notes
Conv encoder (7 blocks)	512	[10,3,3,3,3,2,2] / [5,2,…]	—	7	LayerNorm + GELU after each conv
Pos-bias embedding	n=320, m=800 (bucketed)	—	—	shared	Before 1st Transformer
Transformer block (×12)	768	8 heads; $d_k = 96$ 8=96	3072	12	Pre-LN; dropout 0.1; gated relative position
Final LayerNorm	768	—	—	1	Applied post last block

An approximate parameter-count formula per Transformer layer is $d_k = 96$ 9M, yielding $(i, j)$ 0M parameters for the Base+ model.

6. Distinguishing Features and Research Significance

WavLM Base+'s principal innovations are (1) the integration of gated relative position bias, which modulates positional encoding by content, and (2) joint learning from both masked-prediction and denoising (noisy/overlapped-speech) objectives. Its extensive pre-training using the Mix-94k h dataset further improves downstream generalizability and robustness. These design choices collectively "bring significant improvements for various speech processing tasks on their representative benchmarks," broadening the operational reach of self-supervised speech models beyond automatic speech recognition to encompass paralinguistic and multi-speaker phenomena (Chen et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WavLM Base+ Architecture.