FlexSED: Open-Vocabulary SED System

Updated 25 September 2025

FlexSED is an open‐vocabulary sound event detection system that combines pretrained audio SSL models with a CLAP text encoder for scalable, free-text query handling.
It employs an encoder–decoder architecture with an adaptive layer normalization mechanism (AdaLN) to dynamically fuse semantic text cues with fine-grained audio features.
Evaluations on AudioSet-Strong show superior PSDS1 performance, robust zero-shot and few-shot generalization, and promising applications in audio annotation and assistive technologies.

FlexSED is an open-vocabulary sound event detection (SED) system designed to address limitations in conventional multi-class SED frameworks, notably their inability to process free-text sound queries and poor adaptability to zero-shot and few-shot scenarios. By combining a pretrained audio self-supervised learning (SSL) model with a CLAP text encoder in an encoder–decoder framework, FlexSED enables precise temporal localization and scalable detection over large and diverse sound vocabularies. Its architecture incorporates an adaptive fusion strategy for efficient modality integration and leverages LLMs to mitigate challenges associated with missing labels during training. Empirical evaluations on AudioSet-Strong demonstrate FlexSED’s superior performance compared to vanilla SED models, along with robust generalization to unseen classes.

1. System Architecture

FlexSED is built on a unified encoder–decoder composition that processes both audio and text modalities. Its architecture contains the following primary components:

Pretrained Audio SSL Encoder: The audio frontend utilizes a frame-based spectrogram transformer pretrained via SSL, producing fine-grained acoustic representations at a latent rate of 25 Hz. The original transformer is split into an encoder that extracts prompt-independent audio features.
CLAP Text Encoder: Text prompts (e.g., “A sound of {class}”) are transformed into semantically rich embeddings via the CLAP encoder. These embeddings capture inter-class semantic relationships, which are essential for flexible query handling.
Encoder–Decoder Composition: The encoder performs audio-only feature extraction, while the decoder integrates text prompt embeddings with cached audio features. This separation facilitates prompt-aware evaluation and rapid inference across a multitude of candidate queries.
Adaptive Fusion Mechanism: The decoder employs an adaptive fusion strategy (discussed in Section 2) that conditions audio representations on input text queries, thereby enabling open-vocabulary detection.

This configuration allows FlexSED to efficiently process large candidate prompt sets, providing a foundation for flexible sound event localization across a broad semantic space.

2. Adaptive Fusion Strategy

FlexSED’s decoder implements an adaptive fusion mechanism—inspired by fusion techniques from diffusion transformers—to inject text prompt information into the audio processing stream without diluting the pretrained audio encoder’s capacity.

AdaLN-Based Fusion: FlexSED adopts adaptive layer normalization (AdaLN) in select transformer blocks, specifically using an AdaLN-One variant. Here, the residual scaling gate is initialized to one, ensuring the initial preservation of the pretrained acoustic flow.
Mathematical Formulation: For audio features $x$ and text embedding $p$ , the AdaLN block applies:

$y = x + g(p) \cdot Layer((1 + \gamma(p)) \cdot LN(x) + \beta(p))$

where $LN(\cdot)$ denotes layer normalization, $Layer(\cdot)$ refers to attention or feedforward layers, and $\gamma(p)$ , $\beta(p)$ , $g(p)$ are modulation parameters derived from $p$ . Initially, the block outputs standard residual computations ( $y = x + Layer(LN(x))$ ), transitioning gradually to fused behavior as learned modulation increases.

Contextual Role: This mechanism allows FlexSED to maintain the pretrained audio representation and incrementally learn to leverage semantic cues from text prompts. By dynamically modulating audio features, FlexSED distinguishes among an unlimited set of candidate sound events, thereby operationalizing open-vocabulary SED.

3. Training Methodology

FlexSED’s training regime ensures stable integration of audio and text modalities while addressing annotation sparsity in large-scale datasets.

Continuous Training from Pretrained Weights: The audio encoder is initialized with pretrained SSL weights (e.g., Dasheng-based) and kept mostly static initially. AdaLN fusion gates ( $g(p)$ ) are initialized to unity, preserving encoder throughput and enabling gradual prompt integration.
Prompt-Aware Decoder Fine-Tuning: The decoder responsible for fusion is fine-tuned with a higher learning rate ( $1 \times 10^{-4}$ ), facilitating rapid adaptation to prompt-conditioned localization tasks while minimizing encoder disruption.
LLM-Assisted Negative Query Selection: Training on AudioSet-Strong is complicated by missing labels; some events may be present but unlabeled. FlexSED employs LLMs (e.g., GPT‑4) to select “safe negatives” by analyzing semantic relations (parent–child, synonyms, frequent co-occurrence), minimizing conflicting supervision from erroneously negative labeling.

This training approach achieves robust query alignment and leverages external semantic knowledge to overcome incomplete supervision.

4. Performance Metrics

FlexSED’s efficacy is established through empirical results using the AudioSet‑Strong protocol, employing the PSDS1 metric in two variants:

PSDS1₍ₐ₎ assesses localization and classification across all candidate events (target + non-target).
PSDS1₍ₜ₎ restricts evaluation to known present classes, emphasizing temporal accuracy.

Comparative experiments show that FlexSED exceeds the performance of ATST‑Frame‑SED and Dasheng‑SED on both metrics. Notably, ablation studies confirm that omitting LLM-guided negative query filtering reduces performance, highlighting the necessity of advanced negative selection.

FlexSED demonstrates strong zero-shot and few-shot generalization:

Zero-Shot: When 20 classes are excluded during training, FlexSED retains approximately 65% of original PSDS1₍ₐ₎ performance for these classes.
Few-Shot: Additional fine-tuning on 5–20 labeled samples per new class can nearly restore full performance, evidencing the system’s rapid adaptability and the practical value of its open-vocabulary architecture.

5. Applications and Future Research

FlexSED’s open-source code and pretrained models facilitate multiple downstream applications and suggest avenues for further investigation.

Applications:
- Text-to-Audio Evaluation: Automatically verifies the presence of queried events in generated audio.
- AudioQA and Data Annotation: Enables annotation of weakly labeled datasets and supports audio-driven question answering.
- Assistive and Surveillance Technologies: Provides robust detection and localization for smart home monitoring, environmental sensing, and hearing-assistive devices.
Research Directions:
- Multimodal Fusion: Extending FlexSED to fuse with visual cues for robust detection in complex scenes.
- Knowledge Distillation: Further optimizing cross-modal fusion and efficiency using distilled knowledge across modalities.
- Dataset Scaling: Applying methodology to broader, more diverse corpora for enhanced real-world generalization.
- Augmentation Strategies: Adapting mixup or mean-teacher augmentation frameworks to prompt-aware architectures for improved training efficiency and generalization.

A plausible implication is that continued refinement of cross-modal adaptation and scaling techniques could further increase FlexSED’s effectiveness in real-world open-vocabulary tasks.

6. Technical Significance

FlexSED represents an advancement in sound event detection by unifying pretrained SSL audio representations and CLAP-based semantic text embeddings through an encoder–decoder framework, adaptive fusion (AdaLN-One), and LLM-based supervision enhancement. The integration enables flexible query processing, fine-grained temporal localization, and high scalability to new sound classes without retraining from scratch. The competitive PSDS1 performance, generalization capacity, and proven utility in existing audio benchmarks position FlexSED as a technically robust approach for next-generation SED systems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to FlexSED.