- The paper introduces AdaKWS, which conditions audio features using keyword-specific Adaptive Instance Normalization for dynamic and accurate spotting.
- It employs a lightweight LSTM text encoder with a frozen Whisper transformer to generate normalization parameters and robustly process audio inputs.
- The method achieves efficient training and fast inference while outperforming larger ASR baselines, demonstrating strong multilingual and low-resource capabilities.
The paper "Open-vocabulary Keyword-spotting with Adaptive Instance Normalization" (2309.08561) introduces AdaKWS, a novel method for open-vocabulary keyword spotting (KWS) that addresses the limitations of traditional KWS approaches requiring predefined keywords or audio examples for new keywords.
The core idea of AdaKWS is to condition the audio processing pipeline on the target keyword using Adaptive Instance Normalization (AdaIN). Unlike previous open-vocabulary KWS methods that aim to align audio and text embeddings in a shared latent space, AdaKWS uses a text encoder to generate keyword-specific normalization parameters.
The architecture consists of two main components:
- Text Encoder: A lightweight character-based LSTM (4 layers, 256 hidden dimensions) takes the target keyword as input and outputs the mean (μv) and standard deviation (σv) parameters required for the AdaIN layers.
- Audio Classifier:
- Uses a frozen pre-trained Whisper transformer encoder to process the input audio into representations.
- These audio representations are then passed through two sequential keyword-adaptive modules. Each adaptive module is a standard transformer encoder block where the Layer Normalization layers are replaced by AdaIN layers.
- The AdaIN layers use the keyword-conditioned parameters (μv,σv) generated by the text encoder to normalize and scale the audio features. This process dynamically adapts the audio processing based on the specific keyword being spotted. The formula for AdaIN used is AdaIN(z,v)=σv(σzz−μz)+μv, where z is the audio representation and v is the keyword.
- The keyword-conditioned audio representation is then max-pooled and fed into a linear classifier that predicts the probability of the keyword being present.
AdaKWS is trained end-to-end. The trainable parameters are the text encoder parameters (ϕ) and the shared audio classifier parameters (θ). The keyword-specific AdaIN parameters are a function of the keyword and ϕ. The model is trained using a cross-entropy loss on positive examples (audio containing the keyword) and negative examples (audio not containing the keyword).
A crucial aspect of AdaKWS's training is the use of hard negative sampling techniques to improve the model's ability to distinguish acoustically similar words. The paper introduces several methods for generating hard negatives per batch, including:
- Character Substitution: Altering characters in a positive keyword.
- Keyword Concatenation: Combining a positive keyword with a random keyword.
- Nearest Keyword (NK): Finding keywords in the current batch that have the smallest cosine distance based on their text embeddings from the text encoder's last hidden layer.
The combination of these negative sampling strategies is shown to be significantly more effective than simple random negative sampling.
The practical implications of AdaKWS are significant:
- Open-Vocabulary: It can spot keywords not seen during training by using the text encoder to generate adaptation parameters for any given keyword.
- Multilingual Capability: By training on diverse multilingual data like VoxPopuli and leveraging a pre-trained multilingual audio encoder (Whisper), AdaKWS demonstrates strong performance across many languages and generalizes well to unseen, low-resource languages without fine-tuning.
- No Audio Examples Needed: Unlike query-by-example methods, it only requires the text of the keyword.
- Efficiency: AdaKWS models achieve competitive or superior performance compared to much larger ASR baselines (like Whisper-Large-V2) while having significantly fewer parameters and providing substantially faster inference times (e.g., AdaKWS-Small is about 160x faster than Whisper-Large-V2 according to the paper's experiments).
- Training Data Utilization: Training on entire sentences (up to 30 seconds) avoids the need for precise word-level alignments and increases the amount of usable training data.
Experimental results on VoxPopuli, LibriPhrase, Multilingual LibriSpeech, and Fleurs datasets demonstrate AdaKWS's effectiveness. It sets a new state-of-the-art on the challenging LibriPhrase Hard split and shows impressive zero-shot generalization capabilities to novel languages and datasets compared to strong baselines. The ablation paper confirms the importance of the proposed hard negative sampling methods.
In summary, AdaKWS offers a practical and effective approach to open-vocabulary, multilingual KWS by using adaptive instance normalization controlled by a text encoder, enabling dynamic adaptation to new keywords at inference time with high accuracy and efficiency, especially in low-resource settings.