Papers
Topics
Authors
Recent
Search
2000 character limit reached

WavLM Base+ Architecture

Updated 11 April 2026
  • WavLM Base+ is a self-supervised speech model that integrates masked speech prediction and denoising to enhance multi-speaker awareness.
  • It uses a convolutional feature encoder and a 12-layer Transformer with a gated relative position bias mechanism to capture complex sequential dependencies.
  • Pre-trained on a diverse Mix-94k h dataset, the model delivers significant improvements in speech recognition and paralinguistic tasks.

WavLM Base+ is a self-supervised pre-trained speech model designed to unify content modeling, denoising, and multi-speaker awareness within a single architecture. Building upon the HuBERT framework, it introduces novel masked speech prediction and denoising objectives, employs a gated relative position bias mechanism in its Transformer encoder, and leverages large-scale, diverse training data to advance full-stack speech processing tasks (Chen et al., 2021).

1. Core Architectural Components

WavLM Base+ comprises a convolutional feature encoder and a Transformer-based encoder. The convolutional frontend consists of 7 temporal-convolutional blocks, each with 512 output channels. The kernel widths are [10, 3, 3, 3, 3, 2, 2], and the strides are [5, 2, 2, 2, 2, 2, 2], yielding 512-dimensional features every 20 ms, covering approximately 25 ms of input. After each convolution, LayerNorm and GELU activation are applied.

The Transformer encoder in WavLM Base+ comprises 12 layers, each with a model dimension dmodel=768d_{model} = 768, a feed-forward inner dimension dff=3072d_{ff} = 3072, and 8 attention heads (dk=96d_k = 96 per head). Dropout of 0.1 is applied to both attention weights and FFN output, mirroring HuBERT's configuration. The model adopts pre-norm LayerNorm (preceding each sub-layer) and includes a final LayerNorm after the last block. The overall parameter count for WavLM Base+ is approximately 94.7 million.

2. Gated Relative Position Bias Mechanism

In each Transformer block, WavLM replaces conventional absolute or convolutional relative biases with a gated relative position bias as introduced in XLM-E. For each pair of positions (i,j)(i, j), a "bucketed" relative index ij|i-j| is mapped via T5-style logarithmic bucketing (n=320n=320 buckets, m=800m=800 maximum) to a shared embedding dijd_{i-j}. Two scalar gates per query (gupg^{up} and gresg^{res}) are computed:

dff=3072d_{ff} = 30720

where dff=3072d_{ff} = 30721 is the query projection of the dff=3072d_{ff} = 30722-th hidden state and dff=3072d_{ff} = 30723 are learned vectors.

An intermediate bias is defined as

dff=3072d_{ff} = 30724

where dff=3072d_{ff} = 30725 is a learned scalar. The final gated bias is

dff=3072d_{ff} = 30726

The self-attention logits incorporate this bias: dff=3072d_{ff} = 30727 The attention weights are computed as usual. This approach enables content-dependent modulation of positional bias, enhancing the model's capacity to encode complex sequential dependencies in speech.

3. Dual Pre-Training Objectives: Masked Speech Prediction and Denoising

WavLM Base+ extends the masked prediction paradigm established by HuBERT by integrating denoising objectives. Approximately 20% of pre-training utterances are artificially corrupted by mixing the input speech dff=3072d_{ff} = 30728 (in a random 50%-length segment) either with another utterance (SNR dff=3072d_{ff} = 30729 Uniform(–5, 5) dB) or with a DNS noise clip (SNR dk=96d_k = 960 Uniform(–5, 20) dB). In all cases, 8% of time steps (in spans of 10 frames) are masked as in HuBERT.

During training, the corrupted input dk=96d_k = 961 serves as input while the model is tasked with predicting the original pseudo-label cluster index dk=96d_k = 962 at masked positions using a cross-entropy loss over dk=96d_k = 963 cluster codes: dk=96d_k = 964 No additional denoising head is added; denoising capability emerges from the model's forced recovery of clean pseudo-labels given corrupted inputs.

4. Comparative Configurations: Base, Base+, and Large

WavLM’s architectural variants are distinguished by scale and data used in pre-training:

Variant Layers dk=96d_k = 965 Heads Params (M) Pre-training Data Denoising
Base 12 768 8 94.7 LibriLight 960 h No
Base+ 12 768 8 94.7 Mix 94k h Yes
Large 24 1024 12 316.6 Mix 94k h Yes

Base+ matches Base in architecture but distinguishes itself by pre-training on a larger, more diverse Mix-94k corpus (LibriLight, GigaSpeech, VoxPopuli), for 1M steps, and by using 20% noisy/overlapped-speech simulation. The Large variant scales to 24 layers, dk=96d_k = 966, and 316.6 million parameters.

5. Layer-Wise Component Summary

The organization of WavLM Base+ structural stages is shown below:

Stage Output Dim Kernel/Heads dk=96d_k = 967 Layers Notes
Conv encoder (7 blocks) 512 [10,3,3,3,3,2,2] / [5,2,…] 7 LayerNorm + GELU after each conv
Pos-bias embedding n=320, m=800 (bucketed) shared Before 1st Transformer
Transformer block (×12) 768 8 heads; dk=96d_k = 968=96 3072 12 Pre-LN; dropout 0.1; gated relative position
Final LayerNorm 768 1 Applied post last block

An approximate parameter-count formula per Transformer layer is dk=96d_k = 969M, yielding (i,j)(i, j)0M parameters for the Base+ model.

6. Distinguishing Features and Research Significance

WavLM Base+'s principal innovations are (1) the integration of gated relative position bias, which modulates positional encoding by content, and (2) joint learning from both masked-prediction and denoising (noisy/overlapped-speech) objectives. Its extensive pre-training using the Mix-94k h dataset further improves downstream generalizability and robustness. These design choices collectively "bring significant improvements for various speech processing tasks on their representative benchmarks," broadening the operational reach of self-supervised speech models beyond automatic speech recognition to encompass paralinguistic and multi-speaker phenomena (Chen et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WavLM Base+ Architecture.