SAGANet: Segmentation-Aware Audio Generation

Updated 3 July 2026

SAGANet is a multimodal model that explicitly conditions audio generation on segmentation masks to achieve object-specific Foley synthesis.
It integrates dual visual streams and CLIP-based text encoding, leveraging both global and localized context for improved temporal alignment.
Experimental results show significant improvements in Frechet Distance, IB-score, and synchronization error, with LoRA fine-tuning further enhancing performance.

SAGANet is a multimodal generative model for video object segmentation-aware audio generation, enabling controllable, high-fidelity Foley synthesis by explicitly conditioning audio synthesis on object-level visual segmentation masks. By leveraging both global and localized visual context alongside optional textual prompts, SAGANet provides fine-grained, visually localized user control that outperforms prior state-of-the-art models in precision, synchronization, and semantic alignment within the domain of audiovisual generation (Viertola et al., 30 Sep 2025).

1. Task Definition and Motivation

Video object segmentation-aware audio generation is defined by its explicit input–output structure:

Input: A sequence of video frames $V \in \mathbb{R}^{T_v \times H \times W \times 3}$ , corresponding binary segmentation masks $M \in \{0,1\}^{T_v \times H \times W \times 1}$ that specify the object of interest in each frame, and optionally a textual prompt $t$ .
Output: An audio waveform $A \in \mathbb{R}^{T_a}$ (e.g., a 5 s clip at 44.1 kHz) containing sound predominantly attributable to the masked object.
Formal Objective: $A = G_M(V, M, t)$ , where $G_M$ is trained to focus its auditory generation on the spatiotemporal region defined by $M$ , preserving contextual scene information from $V$ .

Prior video-conditioned audio generation models operate at the scene level and lack the ability to provide sound solely for a specified object, leading to uncontrolled sound-source mixing and poor temporal alignment. These models also often require prohibitive training resources, restricting reproducibility. SAGANet addresses these limitations by introducing mask-based explicit object control, utilizing a lightweight segmentation-aware module and dual visual streams to achieve precise object-level control, efficient training, and strong generalization to multi-source scenes when trained exclusively on single-instrument data (Viertola et al., 30 Sep 2025).

2. Model Architecture

SAGANet's architecture integrates segmentation-based conditioning through the following modules:

Visual Encoder (Segmentation-Aware Control): Dual spatiotemporal streams process both full-frame video with global masks and tightly cropped (minimum $48 \times 48$ px) focal regions around mask bounding boxes. A shared ViT backbone (Synchformer) utilizes gated cross-attention adapters to fuse global and focal features. Mask embeddings are constructed via a learnable linear 3D convolutional patch embedding, initialized to zero, and summed with video patch embeddings and positional encodings.
Textual Encoder: A frozen CLIP-based encoder extracts $F_t$ from the prompt $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 0, with output dimensionality $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 1.
Audio Encoder and Decoder: Audio is represented in either short-time Fourier or latent diffusion embedding space. The audio decoder—a DiT model operating within a Conditional Flow Matching (CFM) framework—predicts the velocity direction for each reverse diffusion step, conditioning on segmentation-aware fused visual features $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 2, global CLIP video features $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 3, textual features $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 4, and noise $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 5.
Optional LoRA Fine-Tuning: Low-rank adapters (rank $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 6, $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 7) augment attention projections associated specifically with segmentation features in DiT, providing efficient parameter adaptation during fine-tuning.

The cross-attention fusion mechanism allows gradual integration of segmentation-relevant context into the focal visual stream after each transformer block, with gating scalars $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 8 and $M \in \{0,1\}^{T_v \times H \times W \times 1}$ 9 initialized to zero for a staged training effect.

3. Objective Functions and Training Protocol

SAGANet is trained using a combination of distributional, control-specific, and perceptual objectives, leveraging the following:

Conditional Flow Matching (CFM) Loss:

$t$ 0

with $t$ 1; CFM focuses on regressing audio velocity fields in diffusion models.

Mask-Consistency Loss:

$t$ 2

where $t$ 3, to encourage sensitivity of audio output to mask manipulations.

Perceptual Audio Loss: An optional term using learned embeddings $t$ 4, such as PANN, to minimize audio perceptual differences:

$t$ 5

Total Objective:

$t$ 6, with typical hyperparameter weights $t$ 7, $t$ 8.

Optimization: AdamW ( $t$ 9, $A \in \mathbb{R}^{T_a}$ 0, $A \in \mathbb{R}^{T_a}$ 1), with a training budget of 4 $A \in \mathbb{R}^{T_a}$ 2A100 GPUs (40 GB) for approximately 40 epochs.

4. Segmented Music Solos Dataset

The Segmented Music Solos benchmark underpins segmentation-aware evaluation by furnishing high-quality, temporally aligned solos with dense mask annotation:

Data Split	Clips	Duration	Classes	Notes
Train	5,395	5 s	25	Solo Instruments
Validation	665	5 s	25	Solo Instruments
Test	745	5 s	25	URMP multi-source

Dataset construction follows a multistage pipeline: (1) instrument solo extraction from MUSIC21/AVSBench/Solos, (2) automatic visual verification via classifier and MPNet matching, (3) auditory verification using 5 s audio windows and AST+MPNet matching, (4) clip extraction with object persistence, and (5) mask annotation using GroundedSAM2, Florence-2, and manual refinement for test splits. All clips are 5 s, with video at 25 FPS (125 frames/clip), and audio sampled at 44.1 kHz (220,500 samples/clip) (Viertola et al., 30 Sep 2025).

5. Experimental Evaluation and Results

Performance is measured across distributional, semantic, and temporal metrics:

Distribution Matching: Frechet Distance (FD) on VGGish, PANNs, and PaSST embeddings; KL divergence on codebooks.
Audio Quality: Inception Score (IS via PANNs).
Semantic Alignment: ImageBind cosine similarity (IB-score).
Temporal Synchronization: DeSync, computed as Synchformer offset error in seconds.

Main Results

Metric	MMAudio Base	SAGANet	SAGANet+LoRA
FD_PaSST (↓)	480.2	390.7	372.9
FD_PANNs (↓)	22.5	17.8	16.4
FD_VGG (↓)	12.8	11.1	10.5
KL_PANNs (↓)	1.12	0.81	0.75
KL_PaSST (↓)	0.93	0.67	0.62
IS (↑)	2.30	2.55	2.62
IB-score (↑)	36.0	42.5	44.1
DeSync (s) (↓)	0.96	0.42	0.35

SAGANet achieves a reduction in Frechet Distance of 18–25%, an $A \in \mathbb{R}^{T_a}$ 318% increase in IB-score, and a halving of temporal synchronization error (DeSync). LoRA fine-tuning further improves all metrics, notably reducing DeSync to 0.35 s.

In ablations, combining global and local visual context with mask channel yields the best quality and timing: focal crops alone improve sync at the expense of overall audio quality, while full-frame inputs underperform in synchronization.

Qualitatively, spectrograms demonstrate SAGANet’s note onset alignment within ±50 ms of visual events (vs. ±300 ms for base), and professional audio evaluators rate its object-level audio focus substantially higher (4.3/5 vs. 2.1/5 for base). In multi-instrument test settings, SAGANet maintains segmentation fidelity whereas the base model merges multiple audio sources.

6. Limitations and Future Research Directions

SAGANet is currently trained on single-source videos, and fully unsupervised mask discovery for multi-source scenes remains unsolved. Expanding to more complex environments with object interactions (e.g., collisions, overlapping motions) may necessitate hierarchical or graph-based mask encoding architectures. Additional research directions include real-time inference, user-interactive mask manipulation for live Foley workflows, and domain adaptation to environmental or speech sounds via segmentation of non-rigid or amorphous objects (Viertola et al., 30 Sep 2025).

7. Significance in Controllable Audio Generation

SAGANet introduces video object segmentation-aware audio generation, setting a new empirical benchmark for object-level Foley synthesis. By conditioning audio on explicit segmentation masks, it empowers users with precise, visually routed generative control in professional workflows and establishes the Segmented Music Solos dataset as a foundation for future research. These contributions and the demonstrated performance advances move the field toward artist-friendly, object-centric multimodal generation (Viertola et al., 30 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Video Object Segmentation-Aware Audio Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAGANet.