FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

Published 1 Apr 2026 in cs.SD | (2604.01155v1)

Abstract: Contrastively pretrained audio-LLMs (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-LLMs (ALMs).

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a dual-stream training paradigm that decouples global and fine-grained audio representations via heterogeneous supervision.
It employs a specialized audio adapter with cluster-based negative sampling, boosting performance in retrieval, classification, and temporal grounding.
The method constructs a scalable synthetic dataset (FineLAP-100k) to mitigate sparse frame-level annotations, enhancing real-world applicability.

FineLAP: Fine-grained Language-Audio Pretraining with Heterogeneous Supervision

Introduction

Fine-grained alignment between audio signals and natural language is essential for a broad spectrum of multimodal tasks, including open-vocabulary sound event detection (SED), text-to-audio retrieval, and temporal audio grounding. While contrastively pretrained audio-LLMs (CLAP) have demonstrated strong results in clip-level tasks through global alignment, their efficacy diminishes in frame-level tasks that require local, temporally resolved audio-text correspondence. The core challenge lies in leveraging ubiquitous coarse-grained clip-level textual descriptions alongside scarce frame-level annotations. The FineLAP framework systematically addresses this by introducing a joint training paradigm on heterogeneous data with tailored loss formulations, an architecture to decouple global and fine-grained learning, and a scalable synthetic SED dataset.

Model Architecture and Training Paradigm

FineLAP extends the CLAP bi-encoder paradigm by incorporating frame-level aligned supervision in conjunction with abundant clip-level annotated data streams. The key architectural advancement is the decoupled audio adapter, which differentiates pathways for extracting global (clip-level) and dense (frame-level) audio representations. The global branch operates via a linear two-layer projection for holistic semantics, while the frame-level pathway leverages pooled and temporally processed embeddings refined through additional Transformer layers for precise alignment. The text branch is anchored by RoBERTa and is similarly mapped into the shared embedding space, supporting both caption and short-phrase representations.

Figure 1: Schematic of FineLAP’s architecture, highlighting the decoupled adapter for multi-granular alignment and the dual-stream supervision paradigm.

The training objective comprises a dual-stream sigmoid loss: a clip-level sigmoid loss aligns audio clips and captions irrespective of batch normalization, while a frame-level sigmoid loss with cluster-based negative sampling aligns temporal audio frames with event phrases. Phrase sampling is strongly regularized using clusters defined over the semantic space, enhancing negative diversity and generalization.

Synthetic Dataset Construction: FineLAP-100k

To offset the deficit of frame-level annotated data, FineLAP introduces FineLAP-100k: a large-scale synthetic SED corpus generated by mixing clean event-centric clips (extracted via a window-based dynamic energy thresholding over FSD50K) and high-quality ambiance backgrounds. Foreground events are randomly repeated, temporally positioned with independently controlled SNRs, and combined using a formalized mixing procedure. Captions are systematically produced using LLM-driven template prompting over event descriptions, offering coverage and variability.

Figure 2: The window-based audio clipping process detects high-energy regions, removing silence and ensuring single-event purity for SED dataset synthesis.

Experimental Results

FineLAP was benchmarked on multiple audio-language understanding datasets spanning retrieval, classification, SED, and temporal grounding. The architecture achieves state-of-the-art (SOTA) performance at both global and local alignment levels.

Figure 3: FineLAP achieves SOTA across both clip- and frame-level tasks, based on standardized benchmarks.

Retrieval and Classification

On the AudioCaps evaluation set, FineLAP outperforms all baselines, with Recall@1 of 45.7 (text-to-audio, T2A) and 62.5 (audio-to-text, A2T). Performance on Clotho is also on par with or better than leading models, demonstrating robust global alignment. In audio classification, FineLAP attains leading accuracy on UrbanSound8K (84.9%) and competitive scores on ESC-50 and VGGSound.

Sound Event Detection and Temporal Grounding

On SED benchmarks, FineLAP achieves 0.344 on DESED, 0.474 on AudioSet-Strong, 0.446 on UrbanSED, and 0.649 on AudioGrounding (TAG), all outperforming both closed- and open-vocabulary SED baselines. Notably, these improvements corroborate the advantage of heterogeneous multi-granularity supervision.

Figure 4: Visual analysis of sound event detection on DESED evidences FineLAP’s precise temporal boundary detection for overlapping events.

Figure 5: FineLAP’s performance on AudioSet-Strong demonstrates effective detection and temporal localization across diverse events.

Ablation Analysis

A series of ablations quantify the contribution of each design aspect. Removing frame-level or clip-level supervision sharply degrades respective and overall performance, demonstrating the necessity and synergy of heterogeneous objectives. The decoupled projection design (rather than a single shared projector) is also validated, yielding consistent improvements in both retrieval and detection. The inclusion of FineLAP-100k synthetic data significantly enhances frame-level metrics, particularly in highly compositional scenarios.

Theoretical and Practical Implications

FineLAP decisively demonstrates that multi-granularity, heterogeneous supervision yields mutually reinforcing benefits for audio-text models, contrasting with pipeline or staged approaches that treat global and local alignment as disjoint. The dual-stream sigmoid loss demonstrates better stability and downstream transfer compared to InfoNCE objectives, particularly in low-signal frame-level regimes. Furthermore, the adoption of cluster-based negative sampling enhances open-vocabulary discrimination, essential for compositional and long-tail sound event coverage.

Practically, FineLAP's flexible, open-vocabulary design is superior for event detection, semantic retrieval, and temporal grounding across general and specialized domains. The scalable synthetic pipeline for SED data generation may serve as a template for future multimodal training regimens where annotation scarcity is prevalent.

Future Directions

While FineLAP advances the state-of-the-art, several avenues remain open. The framework currently processes fixed-length (short to medium duration) clips; extending it to unconstrained long-form or streaming audio presents both architectural and optimization challenges. Expansion of temporally grounded tasks – beyond SED, towards temporal question answering or retrieval with dense event annotations – would further stress-test FineLAP’s alignment fidelity. Finally, dynamic and continual curriculum learning over synthetic and real frame-level datasets may yield further improvements in real-world distribution shifts.

Conclusion

FineLAP introduces a systematic, technically rigorous framework for multi-granularity language-audio pretraining under heterogeneous supervision. Through a dual-stream loss formulation, a decoupled audio-text architecture, and the construction of FineLAP-100k, it sets a new baseline across a range of clip-level and frame-level audio-language benchmarks. The architecture and training paradigm concretely demonstrate the value of unifying coarse and fine alignment, and the methods for phrase-level negative sampling and synthetic frame-level data creation will likely inform multimodal learning regimes for years to come.

Markdown Report Issue