MT-HuBERT: Mix-Training for Overlapped KWS
- The paper introduces a mix-training self-supervised framework that utilizes k-hot masked prediction to robustly detect overlapping keywords.
- It integrates multi-source mixing into the HuBERT architecture, using a single sigmoid head for independent detection of mixed speech signals.
- Empirical evaluations demonstrate significant performance gains in both clean and multi-keyword conditions, outperforming previous mixture-aware SSL models.
Mix-Training HuBERT (MT-HuBERT) is a self-supervised learning framework designed to address few-shot keyword spotting (KWS) in mixed-speech conditions, where multiple overlapping keywords must be detected within a segment. MT-HuBERT integrates the Mix-Training (MT) principle—simultaneous multi-source data mixing and k-hot supervision—directly into masked prediction-based self-supervised pre-training, extending the HuBERT architecture for robust recognition and disentanglement in highly overlapped, data-scarce speech scenarios. This method demonstrates systematic improvements over baseline systems and previous mixture-aware SSL extensions in both clean and multi-keyword test conditions, notably under low-resource adaptation regimes.
1. Architecture and Model Formulation
MT-HuBERT builds on the HuBERT base model, comprising a convolutional feature encoder and a stack of Transformer layers , but substantially modifies the SSL target and output head:
- Local Feature Encoder (): Processes input waveform segments into frame-level representations .
- Masking Operator (): Randomly masks a subset of frames in (typically 10-frame spans, masking 12% of positions), yielding .
- Context Network (): A Transformer stack generates contextual embeddings .
- SSL Prediction Head:
- Codebook of centroids learned via k-means on HuBERT features.
- Projection and temperature .
- Probability assignment: Frame-unit probability is , with the logistic sigmoid.
- Label Construction in Mixed Speech: Given utterances with mixing weights Uniform, the mixture is . Each clean source is tokenized to framewise , producing k-hot union targets at each frame: if is present in any constituent.
2. Mix-Training Objective and Masked k-Hot Prediction
The MT-HuBERT loss is a binary cross-entropy over the cluster codebook at each masked frame:
where is the set of masked frame indices. For (clean speech), reduces to a single 1-hot entry, and the loss collapses to the standard HuBERT cross-entropy. For , the k-hot target enforces that all constituent cluster codes of the mixture must be detectable. No permutation-invariant assignment or explicit source separation head is required: a single sigmoid vector output suffices, reflecting presence or absence over the cluster vocabulary.
3. Pre-Training Data Simulation and Unlabeled Data Regime
Pre-training leverages large-scale unlabeled data, employing on-the-fly mixture simulation:
- Corpus: LibriSpeech-960h.
- Mixture Construction: With probability , construct mixtures of two utterances, sampling weights from Uniform; with , use clean utterances. All mixtures are strictly energy-normalized () to preserve audibility of all sources.
- Codebook: K-means (on HuBERT layer-9 features) yields centroids, which serve as unsupervised targets.
- Optimization: Adam (, $32k$ warmup steps), batch size frames per GPU, $1.6$M total steps.
- Masking: random frame masking, span size $10$.
This mixing regime exposes the model to both clean and mixed segments throughout SSL, training the backbone to reconstruct constituent speech units from pooled, overlapped observations.
4. Few-Shot Fine-Tuning and Adaptation Strategies
After pre-training, the feature encoder and Transformer backbone are frozen. Adaptation for KWS is performed by adding a two-layer linear classifier atop the contextual embeddings and optimizing via BCE on the support set; three fine-tuning protocols are investigated:
- Clean: Support set consists of clean single-keyword utterances, without mixtures.
- Mixup: Each support batch induces synthetic examples via mixing ( Beta), with labels .
- Mix-Training (MT): Emulates the pre-training regime: 2-mix support examples, energy-normalized mixing, k-hot target union. Training mirrors pre-training loss but adapts only the classifier head (HuBERT backbone frozen).
Each support utterance is encoded via ; framewise outputs are mean-pooled, passed through the classifier, and evaluated with sigmoid activations. Hyperparameters include batch size $64$ per GPU, $50$ epochs, Adam optimizer with , and averaging of the last $10$ checkpoints.
5. Experimental Evaluation in Few-Shot Mixed Speech KWS
Key experiments are conducted on Google Speech Commands v2 (GSC v2), under different adaptation and evaluation scenarios:
- Support regime: 15, 5, or 3 examples per class (35 training keywords), with 5 random seeds per shot count.
- Test suite:
- Clean speech: 10-keyword official test set, Top-1 accuracy (ACC).
- 2-mix: Random pairwise mixtures (1:1 energy ratio), Top-2 ACC.
- 3-mix: Triplet mixtures (1:1:1), Top-3 ACC (not seen in training).
- Results (15-shot, MT adaptation):
- Clean: MT-HuBERT attains 93.80% ACC, 2.95% EER.
- 2-mix: 79.78% ACC, 8.98% EER.
- 3-mix: 65.91% ACC, 15.99% EER.
MT adaptation consistently outperforms both Clean and Mixup, with greater gains in lower-shot and high-overlap conditions. MT-HuBERT outpaces previous mixture-aware SSL models such as Cocktail HuBERT in all such scenarios.
6. Empirical Insights, Ablations, and Implementation Details
Ablation studies and implementation observations reveal several technical considerations:
- Mixing weights: Uniform ensures all sources remain above perceptual threshold and avoids over-sparsification; extremes near $0$ or $1$ degrade source attribution.
- Dataset scaling: Full 960h of unlabelled data is essential for backbone quality; reductions yield linearly diminished KWS generalization.
- Model scale: All experiments leverage MT-HuBERT_BASE; scaling to "Large" architectures is predicted (but not yet validated) to further enhance both clean and overlapped KWS.
- K-hot loss: The fundamental gain of MT-HuBERT over interpolation-based Mixup approaches lies in treating each class as an independent binary detection (k-hot), boosting weak-source keyword recovery.
- Clean/mixed trade-off: There is a modest absolute reduction in clean-speech accuracy (compared to clean-only HuBERT), but substantial relative gains in 2-mix and 3-mix performance.
- Adaptation overhead: For frozen SSL backbones, MT adaptation on the classifier head alone is nearly as effective as full MT pre-training, streamlining training in practical deployments.
7. Context, Comparisons, and Applications
MT-HuBERT originates from the broader lineage of HuBERT and mixture-robust SSL frameworks such as Cocktail HuBERT (Fazel-Zarandi et al., 2023). It differs from multi-head source reconstruction approaches by using a single sigmoid head with a multi-hot target, which simplifies training and generalizes directly to arbitrary numbers of overlapping sources. Relative to previous mixture-aware pre-training, MT-HuBERT’s k-hot masked prediction outperforms both permutation-invariant multi-head losses and Mixup-interpolation baselines in generalization to few-shot, highly overlapped, and unseen mixture conditions.
Practical applications include modern voice assistants, ambient device keyword detection, and low-resource KWS in noisy real-world settings, where overlapped speech is the norm rather than the exception. MT-HuBERT’s data efficiency (i.e., >88% Top-1 ACC on clean speech with only 15 support samples/class, 65% Top-2 on mixtures) positions it as a directly deployable solution for real-world speech interfaces demanding both low resource usage and robust overlap resilience.
A plausible implication is that as MT-HuBERT introduces k-hot masked prediction as its central mechanism, it may inform broader designs in SSL architectures for multi-label, multi-source recognition problems beyond speech.