Probabilistic Language-Audio Pre-Training
- The paper introduces a novel framework that models language and audio as Gaussian distributions to capture many-to-many relations and inherent uncertainty.
- It employs hierarchical inclusion and mask repulsive losses to enforce semantic containment and prevent embedding collapse in the joint space.
- The method achieves state-of-the-art performance in audio-text retrieval and hierarchical reasoning on modest datasets without relying on large-scale external corpora.
Probabilistic Language-Audio Pre-Training (ProLAP) is a joint representation learning framework that encodes language and audio as Gaussian distributions in a shared embedding space. This probabilistic framework is designed to capture the intrinsic many-to-many relationships between audio recordings and textual descriptions, as well as the natural uncertainty and hierarchical inclusions found in real-world data. ProLAP introduces two key objectives—hierarchical inclusion loss and mask repulsive loss—to enable fine-grained modeling of semantic hierarchy and uncertainty, and achieves state-of-the-art performance on audio-text retrieval and hierarchical reasoning tasks, without requiring large-scale external data (Manabe et al., 21 Oct 2025).
1. Motivation and Problem Setting
Most existing language–audio embedding models (e.g., CLAP) employ deterministic mappings, assigning each input an individual point in the joint embedding space. This approach enforces a one-to-one mapping between inputs, failing to account for the many-to-many correspondences that occur in practice: an audio clip may have multiple equally valid captions of varying specificity (such as “string instrument”, “guitar”, or “acoustic guitar”), and captions themselves may admit many paraphrases. Deterministic embeddings cannot adequately capture this multiplicity or represent the associated epistemic uncertainty.
To address this, ProLAP models each input as a probability distribution (specifically, a Gaussian) in the joint space. This probabilistic representation allows explicit modeling of the spread (uncertainty due to ambiguous or multiple descriptions) and enables encoding of hierarchical semantic relations (e.g., the inclusion of specific in general concepts). This framework is particularly important in settings where the language–audio relationship is inherently uncertain and hierarchical (Manabe et al., 21 Oct 2025).
2. Probabilistic Embedding Design
2.1 Gaussian Embedding Parameterization
Each audio or text input is mapped to a Gaussian random variable: where is the mean embedding and is a diagonal covariance matrix.
2.2 Corrected Similarity Measure
Affinity between two Gaussian embeddings and is measured using a corrected dot-product similarity: This measure penalizes high uncertainty and encourages mean alignment in the joint space.
2.3 Distance Interpretation
Instead of KL or Wasserstein divergences, ProLAP employs a closed-form corrected sampled distance (CSD) embedded in the similarity function above. This approach efficiently aligns means and controls variances, facilitating gradient-based optimization.
3. Loss Functions and Hierarchical Learning
3.1 Probabilistic Pairwise Contrastive Loss (PPCL)
For audio-text pairs, ProLAP uses a probabilistic variant of contrastive loss: where for positive pairs and otherwise, with as learnable parameters.
3.2 Inclusion Loss
To enforce semantic containment (e.g., "acoustic guitar" inside "guitar"), ProLAP introduces an inclusion statistic: The loss is computed as a logit-linked negative log-likelihood: with . Cross-modal inclusion encourages audio distributions to be less uncertain than their captions; intra-modal inclusion applies between raw and masked variants.
3.3 Hierarchical Inclusion Loss
A chain of nested random masks is constructed to expose multi-level inclusion structure: This incentivizes a consistent containment hierarchy in latent space.
3.4 Mask Repulsive Loss
To prevent collapse of masked embeddings, a repulsive loss pushes masked versions of inputs apart: for (same sample) or (otherwise). Gradients with respect to variances are stopped to avoid trivialization.
3.5 Variational Information Bottleneck Regularization
A small VIB regularizer, , prevents variance collapse and supports consistent uncertainty modeling.
3.6 Full Training Objective
The complete objective for a batch is: with intra-modal loss: ProLAP demonstrated effective learning with relatively low hyperparameter regularization values and mask levels.
4. Training Protocol and Dataset Considerations
ProLAP fine-tunes pretrained CLAP weights for 50 epochs with batch size 256, utilizing the Adam optimizer and cosine learning rate decay (peak ) with a one-epoch warm-up. Masking is applied to 75% of features in 12.5% of each batch for intra-modal learning.
Datasets:
- AudioCaps: 51,308 clips (1 caption each)
- ClothoV2: 5,930 clips (5 captions each)
No large-scale external corpus is used; hierarchical structure emerges from these relatively small sets.
Feature Encoders:
- Audio: HTS-AT (Swin-Transformer-based), with a learnable [MASK] token head.
- Text: GPT-2, extracting the “[CLS]” token embedding for the mean and a special “[UNC]” token for variance.
This design enables ProLAP to learn robust hierarchical uncertainty in the language–audio domain even at small data scales, a marked contrast with prior probabilistic models in vision requiring orders of magnitude more data.
5. Empirical Results and Evaluation
5.1 Audio–Text Retrieval
Tasks measured include text-to-audio and audio-to-text retrieval using Recall@1/5/10 and mAP@10. Baselines considered are deterministic CLAP with InfoNCE, CLAP+Sigmoid (SigLIP), and a ProLIP-style probabilistic variant. ProLAP consistently outperforms these alternatives, especially under cross-dataset (out-of-domain) evaluation.
5.2 Uncertainty Estimation
Text length vs. predicted uncertainty: ProLAP, with hierarchical inclusion and mask repulsive objectives, produces variance that is strongly and negatively correlated with caption length—longer (more specific) captions yield lower uncertainty. Baseline models, in contrast, show negligible variance trends with specificity.
Audio embedding visualization: Under ProLAP, masked and unmasked embeddings remain well-separated in the latent space, and their inclusion ordering reflects semantic containment, unlike the baseline where masking effects collapse.
5.3 Audio Traversal Task
A new “audio traversal” task is introduced: for each AudioCaps clip, four levels of increasingly abstract captions are generated by a LLM. A “root” ([ROOT]) embedding is defined and embeddings are linearly interpolated from the audio to the root in 50 steps, with text retrieval at each step.
Metrics:
- Precision: Fraction of retrieved captions matching any hierarchical level.
- Recall@1: Likelihood of recovering the original caption at varying abstraction.
- R@1 at most abstract level: Capturing top-level generalization.
ProLAP (hierarchical inclusion + repulsion) significantly improves precision (~27.3% vs. 12.8% for CLAP/SigLIP) and recall performance.
5.4 Ablation and Analysis
- Hierarchical inclusion alone: Yields substantial gains in traversal precision (13.5% → 23.3%) and inclusion accuracy (level-1 includes level-4: 63% → 83%).
- Mask repulsive alone: Modest or negative effect.
- Both losses: Achieve the highest hierarchical and retrieval performance (precision ~27.3%, inclusion-test ~89.5%).
6. Analysis, Limitations, and Open Questions
Probabilistic embeddings with hierarchical inclusion loss robustly capture the many-to-many mapping and semantic containment between audio and text. The mask repulsive objective prevents collapse under masking, making multi-scale inclusion feasible. ProLAP demonstrates notable data efficiency, achieving meaningful hierarchical uncertainty structure with tens of thousands of training pairs—unlike probabilistic vision models trained on billions of samples.
Open questions and limitations:
- Only diagonal-covariance Gaussian distributions are explored; richer mixture models may further improve uncertainty representation.
- The cross-modal inclusion loss weight must remain very small to avoid harming retrieval, indicating sensitivity and the need for balanced regularization.
- Scalability to massive audio–text corpora and generalization to tri-modal setups (audio, text, video) remain open research directions.
7. Implications, Extensions, and Future Work
ProLAP systematically extends the CLAP deterministic encoder architecture by embedding input distributions, allowing explicit modeling of many-to-many semantic relations and fine-grained hierarchical containment. Two architectural losses—hierarchical inclusion and mask repulsive—are central to learning both semantic uncertainty and hierarchy.
Empirical outcomes indicate consistent improvements in retrieval and hierarchical tasks over deterministic baselines, and demonstrate that the learned uncertainties are meaningful in practical evaluation (e.g., audio traversal, inclusion tests). Future work may explore non-Gaussian or multimodal embedding families, alternative divergence measures (e.g., Wasserstein), and scaling to richer benchmarks such as WavCaps and Auto-ACD, as well as the extension to unified audio–video–language pre-training (Manabe et al., 21 Oct 2025).