Weakly Labelled Source Separation
- The paper demonstrates that weak labelling enables models to learn time-frequency masks and reconstruct individual source signals without detailed annotations.
- Key methodologies include a joint separation-classification framework with specialized pooling schemes (GMP, GAP, GWRP) to optimize source reconstruction.
- Practical applications span music transcription and environmental sound detection, though challenges remain in handling overlapping sources and limited supervision.
Source separation with weakly labelled data refers to approaches that decompose audio mixtures into individual source signals (e.g., musical instruments, environmental events, speech) using only coarse tag information about what sources are present in the recording, but without access to strong supervision such as time-aligned isolated source tracks or fine-grained temporal annotations. This paradigm leverages generalized weak labels—such as clip-level class assignments or coarse onset/offset timing—frequently to bypass the resource-intensive requirements of fully supervised separation protocols. Recent deep learning advances demonstrate that source separation under these constraints is attainable through specialized network architectures and objective functions that use weak supervision to induce event-specific representations and time-frequency masks.
1. Weakly Labelled Data and Source Separation Paradigms
Weakly labelled data consists of audio recordings where only high-level tags (e.g., “babycry” or “gunshot” present) are available, and the exact temporal or spectral locations of the events are unknown (Kong et al., 2017). This sparsity of annotation contrasts with strong supervision where both isolated source waveforms and fine temporal boundaries are available. The objective is to estimate, for each source class present in the mixture, all relevant time-frequency bins or time frames that correspond to that source and reconstruct its waveform.
Approaches to handling such data generally fall into two categories:
- Clip-level weak supervision: Only multi-hot source presence/absence is provided for the entire audio clip.
- Frame-level weak supervision: The time intervals of source activity are annotated but not their TF locations.
Network architectures for weakly labelled separation are forced to learn internal latent representations (e.g., TF masks, embeddings, or segmentation maps) that effectively localize and reconstruct individual sources based solely on global or coarse event tags.
2. Key Methodologies: Mask-based and Conditional Models
The dominant architecture for source separation with weak labelling is the joint separation-classification (JSC) model (Kong et al., 2017). It comprises two distinct mappings:
- Separation mapping (φ): A fully convolutional network operating on the time-frequency representation (e.g., log-mel spectrogram) generates K segmentation masks M_k(t,f) ∈ [0,1] per class.
- Classification mapping (ψ): Each mask is reduced to a presence probability p_k=ψ(M_k) via pooling operators, such as global max pooling (GMP), global average pooling (GAP), or global weighted rank pooling (GWRP).
The system is trained with a binary cross-entropy over the clip-level labels; mask-level supervision is absent. At inference, masks M_k are directly applied to the mixture’s magnitude spectrum to reconstruct separated waveforms, and their time-averaged projections inform sound event detection.
Conditional U-Net frameworks, especially for large-vocabulary separation (Kong et al., 2020), utilize a Sound Event Detection (SED) network to locate anchor segments highly likely to contain a single source class. A mixture is synthesized from disjoint anchor segments, and a conditional separation network is trained to extract each anchor when provided with the corresponding SED-derived soft condition vector.
Spectrum Energy Preserved Wasserstein learning (Zhang et al., 2017) represents a complementary direction, combining a generator network for separation (U-Net style encoder-decoder) and per-class discriminators, with a loss function fusing per-source distribution matching (WGAN) and a global energy-preservation penalty:
This matches estimated per-source distributions to unpaired real samples (weak label) while ensuring mixture energy consistency.
3. Pooling Schemes and Attention Mechanisms
The efficacy of separating sources from weak labels hinges on the choice of pooling schemes in the classification mapping ψ. Empirical analysis shows:
- Global Max Pooling (GMP): Highly selective, but can under-segment event masks and result in poor SDR (≈0dB).
- Global Average Pooling (GAP): Decent mask coverage but can lead to over-segmentation (SDR ≈6dB).
- Global Weighted Rank Pooling (GWRP): Descending geometric weighting of mask activations, optimal for balancing mask coverage and selectivity (SDR ≈8dB) (Kong et al., 2017).
Retention of full time-frequency resolution in the separator (no downsampling/pooling in φ) is essential for mask quality and separation performance.
4. Training Protocols, Loss Functions, and Evaluation Metrics
Training is conducted exclusively using weakly labelled data. The loss function is typically a multi-label binary cross-entropy over predicted class presence probabilities :
No mask-level or isolated source supervision is imposed; the network learns by optimizing the masking and classification mappings to maximize correspondence with the global presence tags.
Evaluation is performed with standard source separation metrics (SDR, SIR, SAR from BSS Eval) and sound event detection metrics (Equal Error Rate, F1, mAP). Empirical benchmarks (Kong et al., 2017) show:
| Mapping | Separation SDR (dB) | SED EER |
|---|---|---|
| GMP | 0.03 | 0.30 |
| GAP | 6.06 | 0.14 |
| GWRP | 8.08 | 0.14 |
GWRP is superior for both separation and SED.
5. Practical Applications and Limitations
Weakly labelled separation has found application in diverse domains including environmental sound event detection, music transcription, and rare-event SED (Kong et al., 2017). By obviating the requirement for isolated source stems, these systems facilitate large-scale deployment using readily available audio corpora with coarse tagging.
However, limitations persist. Separation quality remains dependent on the separator architecture and pooling paradigm; highly polyphonic or overlapping events can degrade performance. The absence of time-aligned strong supervision precludes perfect mask estimation; empirical mask F1 scores are still far from ideal binary masks. Techniques such as GWRP and strict avoidance of downsampling are necessary to push mask quality toward optimality.
Pooling variants in ψ represent the main architectural ablation considered; further innovations may include alternate attention mechanisms or conditioning pipelines. Mask-level supervision, if available, would likely yield further gains but is outside the weak label framework.
6. Future Directions and Open Challenges
Current methods demonstrate that weakly labelled source separation—using only global tags—can produce competitive separation and event detection results. Improvements may include:
- Integration of additional structural constraints (e.g., spectral energy consistency, adversarial objectives).
- Development of advanced conditioning methods for hierarchical or universal separation frameworks.
- Extension to more complex acoustic scenes with multiple highly overlapping sources.
- Enhancement of mask learning via semi-supervised mixture strategies or self-supervised approaches.
Achieving separation equivalent to strongly supervised models using only weak labels remains an open challenge, motivating further research into attention mechanisms, probabilistic pooling, and unsupervised mask refinement.
References:
- "A joint separation-classification model for sound event detection of weakly labelled data" (Kong et al., 2017)
- "Weakly Supervised Audio Source Separation via Spectrum Energy Preserved Wasserstein Learning" (Zhang et al., 2017)