WavLM Base+ Architecture
- WavLM Base+ is a self-supervised speech model that integrates masked speech prediction and denoising to enhance multi-speaker awareness.
- It uses a convolutional feature encoder and a 12-layer Transformer with a gated relative position bias mechanism to capture complex sequential dependencies.
- Pre-trained on a diverse Mix-94k h dataset, the model delivers significant improvements in speech recognition and paralinguistic tasks.
WavLM Base+ is a self-supervised pre-trained speech model designed to unify content modeling, denoising, and multi-speaker awareness within a single architecture. Building upon the HuBERT framework, it introduces novel masked speech prediction and denoising objectives, employs a gated relative position bias mechanism in its Transformer encoder, and leverages large-scale, diverse training data to advance full-stack speech processing tasks (Chen et al., 2021).
1. Core Architectural Components
WavLM Base+ comprises a convolutional feature encoder and a Transformer-based encoder. The convolutional frontend consists of 7 temporal-convolutional blocks, each with 512 output channels. The kernel widths are [10, 3, 3, 3, 3, 2, 2], and the strides are [5, 2, 2, 2, 2, 2, 2], yielding 512-dimensional features every 20 ms, covering approximately 25 ms of input. After each convolution, LayerNorm and GELU activation are applied.
The Transformer encoder in WavLM Base+ comprises 12 layers, each with a model dimension , a feed-forward inner dimension , and 8 attention heads ( per head). Dropout of 0.1 is applied to both attention weights and FFN output, mirroring HuBERT's configuration. The model adopts pre-norm LayerNorm (preceding each sub-layer) and includes a final LayerNorm after the last block. The overall parameter count for WavLM Base+ is approximately 94.7 million.
2. Gated Relative Position Bias Mechanism
In each Transformer block, WavLM replaces conventional absolute or convolutional relative biases with a gated relative position bias as introduced in XLM-E. For each pair of positions , a "bucketed" relative index is mapped via T5-style logarithmic bucketing ( buckets, maximum) to a shared embedding . Two scalar gates per query ( and ) are computed:
0
where 1 is the query projection of the 2-th hidden state and 3 are learned vectors.
An intermediate bias is defined as
4
where 5 is a learned scalar. The final gated bias is
6
The self-attention logits incorporate this bias: 7 The attention weights are computed as usual. This approach enables content-dependent modulation of positional bias, enhancing the model's capacity to encode complex sequential dependencies in speech.
3. Dual Pre-Training Objectives: Masked Speech Prediction and Denoising
WavLM Base+ extends the masked prediction paradigm established by HuBERT by integrating denoising objectives. Approximately 20% of pre-training utterances are artificially corrupted by mixing the input speech 8 (in a random 50%-length segment) either with another utterance (SNR 9 Uniform(–5, 5) dB) or with a DNS noise clip (SNR 0 Uniform(–5, 20) dB). In all cases, 8% of time steps (in spans of 10 frames) are masked as in HuBERT.
During training, the corrupted input 1 serves as input while the model is tasked with predicting the original pseudo-label cluster index 2 at masked positions using a cross-entropy loss over 3 cluster codes: 4 No additional denoising head is added; denoising capability emerges from the model's forced recovery of clean pseudo-labels given corrupted inputs.
4. Comparative Configurations: Base, Base+, and Large
WavLM’s architectural variants are distinguished by scale and data used in pre-training:
| Variant | Layers | 5 | Heads | Params (M) | Pre-training Data | Denoising |
|---|---|---|---|---|---|---|
| Base | 12 | 768 | 8 | 94.7 | LibriLight 960 h | No |
| Base+ | 12 | 768 | 8 | 94.7 | Mix 94k h | Yes |
| Large | 24 | 1024 | 12 | 316.6 | Mix 94k h | Yes |
Base+ matches Base in architecture but distinguishes itself by pre-training on a larger, more diverse Mix-94k corpus (LibriLight, GigaSpeech, VoxPopuli), for 1M steps, and by using 20% noisy/overlapped-speech simulation. The Large variant scales to 24 layers, 6, and 316.6 million parameters.
5. Layer-Wise Component Summary
The organization of WavLM Base+ structural stages is shown below:
| Stage | Output Dim | Kernel/Heads | 7 | Layers | Notes |
|---|---|---|---|---|---|
| Conv encoder (7 blocks) | 512 | [10,3,3,3,3,2,2] / [5,2,…] | — | 7 | LayerNorm + GELU after each conv |
| Pos-bias embedding | n=320, m=800 (bucketed) | — | — | shared | Before 1st Transformer |
| Transformer block (×12) | 768 | 8 heads; 8=96 | 3072 | 12 | Pre-LN; dropout 0.1; gated relative position |
| Final LayerNorm | 768 | — | — | 1 | Applied post last block |
An approximate parameter-count formula per Transformer layer is 9M, yielding 0M parameters for the Base+ model.
6. Distinguishing Features and Research Significance
WavLM Base+'s principal innovations are (1) the integration of gated relative position bias, which modulates positional encoding by content, and (2) joint learning from both masked-prediction and denoising (noisy/overlapped-speech) objectives. Its extensive pre-training using the Mix-94k h dataset further improves downstream generalizability and robustness. These design choices collectively "bring significant improvements for various speech processing tasks on their representative benchmarks," broadening the operational reach of self-supervised speech models beyond automatic speech recognition to encompass paralinguistic and multi-speaker phenomena (Chen et al., 2021).