T-CLAP: Temporal Contrastive Audio-Text Model
- T-CLAP is a temporal modeling framework that integrates explicit ordering supervision into contrastive language-audio pretraining architectures.
- It uses synthetic negatives via mixed-up audio sampling and LLM-synthesized captions to enforce accurate event sequencing in retrieval and classification tasks.
- The approach combines standard CLAP loss with a temporal-focused contrastive loss, yielding significant performance improvements in metrics like R@1 and zero-shot accuracy.
Temporal Modeling (T-CLAP) refers to the integration of explicit temporal ordering supervision into contrastive language-audio pretraining architectures, primarily aiming to resolve the inability of standard CLAP models to capture the sequential relationship of events within audio-text representations. T-CLAP incorporates synthetic temporally contrasting audio–text pairs and a dedicated temporal-focused contrastive loss, resulting in significantly improved performance for tasks sensitive to event order, including retrieval, zero-shot classification, and text-to-audio generation (Yuan et al., 2024).
1. Architectural Foundation
T-CLAP preserves the two-tower architecture characteristic of Contrastive Language-Audio Pretraining (CLAP) models. The architecture consists of:
- Audio Branch: Utilizes HTSAT as the encoder, followed by a small multilayer perceptron (MLP) head to project the representation into a -dimensional embedding space.
- Text Branch: Uses RoBERTa as the text encoder, likewise followed by an MLP head projecting output into the same -dimensional space.
No additional temporal-specific transformer layers are introduced. Temporal comprehension is instead achieved by fine-tuning the encoders and projection heads so that distinctions in event order are reflected in the cosine similarity structure between audio and text embeddings. HTSAT absorbs local time–frequency information, while RoBERTa is exposed to captions with varied event orderings. This minimalist intervention supports temporal reasoning without a wholesale redesign of the backbone encoder structures.
2. Temporal-Contrastive Caption Generation
T-CLAP’s learning of temporal relationships is driven by specially constructed pairs of audio and captions distinguishing “A then B” from “B then A”. Two primary strategies for producing temporally-contrasting negatives are used:
- Mixed-up Audio Sampling (ESC-mixed-up):
- Pairs of 5 s ESC-50 audio clips, each representing a single event, are concatenated into a 10 s clip.
- The corresponding “positive” caption adopts a format such as “A followed by B”, and the negative caption swaps order (“B followed by A”).
- LLM-Synthesized Negatives:
- For human-written captions from AudioCaps or Clotho, an LLM (e.g., ChatGPT) is prompted to identify two sound events and rewrite the caption with the order inverted. This produces linguistically varied order-flipped negatives, ranging from changes in connectives (“first”, “then”, commas) to more elaborate grammatical structures.
Combining these pipelines ensures the model encounters both syntactically controlled and semantically diverse order inversions, mitigating overfitting to specific phrasing patterns.
3. Temporal-Focused Contrastive Loss
The training objective is augmented to reward correct temporal alignment and penalize reversed event order:
- Standard CLAP contrastive loss () enforces global matching between audio and caption pairs across the batch:
with a symmetric term for text-to-audio.
- Temporal-focused contrastive loss () introduces a single-pair ordering margin:
where represents the embedding for the negative (order-swapped) caption of the same audio. The overall fine-tuning objective is:
drives embeddings for correct-order captions closer than those for reversed-order captions, providing explicit supervision over event sequences.
4. Training Pipeline and Dataset Construction
T-CLAP’s fine-tuning is performed on approximately 2.04 million audio–text pairs aggregated from several datasets:
- Auto-ACD: 1.9 M LLM-generated captions (non-flipped).
- AudioCaps: 50 k human captions + 50 k LLM-flipped negatives.
- Clotho: 5 k human captions + 5 k LLM-flipped negatives.
- ESC-mixed-up: 50 k mixed-up positives + 50 k order-flipped negatives.
Each batch (size 512) is composed in a 4 : 1 ratio of Auto-ACD (used only in ) to the three temporally supervised datasets (used in both 0 and 1). Hyperparameters include a learning rate of 2, linear warm-up over 10 k steps, a total of 30 k training steps, batch size of 512, and shared temperature 3. No specialized curriculum strategy is used beyond batch composition.
5. Performance and Empirical Analysis
Significant empirical gains are reported across multiple downstream tasks:
| Task | Metric | CLAP_l | T-CLAP |
|---|---|---|---|
| AudioCaps Text→Audio | R@1 | 34.2% | 39.7% |
| AudioCaps Audio→Text | R@1 | 43.1% | 49.8% |
| ESC-50 Zero-Shot | Accuracy | 91.0% | 96.5% |
| Urbansound8K Zero-Shot | Accuracy | 75.8% | 78.4% |
| VGGSound Zero-Shot | Accuracy | 23.8% | 42.8% |
| T-Classify Text→Audio | Score | 56.2% | 87.2% |
| T-Classify Audio→Text | Score | 53.2% | 72.0% |
In text-to-audio generation (AudioLDM, measured by KL, FAD, and MOS objectively), T-CLAP serves as a superior guidance model for capturing temporal alignment, as evidenced by improved FAD (1.8 vs 2.5) and higher temporal MOS (4 vs 5).
Qualitative analysis demonstrates that T-CLAP retrieves captions correctly reflecting event order, while baseline CLAP often confuses sequential relationships.
6. Ablation and Analysis
When isolating T-CLAP components:
- Mixed-up only: Provides consistent gains for datasets with controlled wording but lacks generalization to diverse language patterns.
- LLM-only negatives: Enhances diversity but risks overfitting to semantic cues (“first”, “then”).
- Combined pipeline: Achieves optimal performance by balancing coverage and linguistic variability.
This suggests that a blend of both controlled (syntactic consistency) and open-ended (semantic diversity) temporal negatives is necessary for robust temporal alignment modeling.
7. Limitations and Prospective Directions
T-CLAP is principally evaluated on environmental sound event tasks; coverage does not extend to temporally rich domains like speech or music. All negative audio samples are synthetically generated through mixing and concatenation, which may not fully represent natural temporal overlaps. Future directions include:
- Extension to multi-event segmentation and alignment tasks.
- Application to fine-grained phoneme or musical phrase ordering.
- Integration of more sophisticated curriculum learning based on event complexity.
- Utilization of real multi-event audio recordings to enhance robustness.
A plausible implication is that temporal contrastive learning methods like T-CLAP could be adapted to other domains where sequence modeling is critical, provided suitable negative generation schemes and loss functions are developed. The approach demonstrates that introducing temporally-focused supervision can substantially enhance the temporal discrimination capacity of joint language–audio representations (Yuan et al., 2024).