Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAGANet: Segmentation-Aware Audio Generation

Updated 3 July 2026
  • SAGANet is a multimodal model that explicitly conditions audio generation on segmentation masks to achieve object-specific Foley synthesis.
  • It integrates dual visual streams and CLIP-based text encoding, leveraging both global and localized context for improved temporal alignment.
  • Experimental results show significant improvements in Frechet Distance, IB-score, and synchronization error, with LoRA fine-tuning further enhancing performance.

SAGANet is a multimodal generative model for video object segmentation-aware audio generation, enabling controllable, high-fidelity Foley synthesis by explicitly conditioning audio synthesis on object-level visual segmentation masks. By leveraging both global and localized visual context alongside optional textual prompts, SAGANet provides fine-grained, visually localized user control that outperforms prior state-of-the-art models in precision, synchronization, and semantic alignment within the domain of audiovisual generation (Viertola et al., 30 Sep 2025).

1. Task Definition and Motivation

Video object segmentation-aware audio generation is defined by its explicit input–output structure:

  • Input: A sequence of video frames V∈RTv×H×W×3V \in \mathbb{R}^{T_v \times H \times W \times 3}, corresponding binary segmentation masks M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1} that specify the object of interest in each frame, and optionally a textual prompt tt.
  • Output: An audio waveform A∈RTaA \in \mathbb{R}^{T_a} (e.g., a 5 s clip at 44.1 kHz) containing sound predominantly attributable to the masked object.
  • Formal Objective: A=GM(V,M,t)A = G_M(V, M, t), where GMG_M is trained to focus its auditory generation on the spatiotemporal region defined by MM, preserving contextual scene information from VV.

Prior video-conditioned audio generation models operate at the scene level and lack the ability to provide sound solely for a specified object, leading to uncontrolled sound-source mixing and poor temporal alignment. These models also often require prohibitive training resources, restricting reproducibility. SAGANet addresses these limitations by introducing mask-based explicit object control, utilizing a lightweight segmentation-aware module and dual visual streams to achieve precise object-level control, efficient training, and strong generalization to multi-source scenes when trained exclusively on single-instrument data (Viertola et al., 30 Sep 2025).

2. Model Architecture

SAGANet's architecture integrates segmentation-based conditioning through the following modules:

  • Visual Encoder (Segmentation-Aware Control): Dual spatiotemporal streams process both full-frame video with global masks and tightly cropped (minimum 48×4848 \times 48 px) focal regions around mask bounding boxes. A shared ViT backbone (Synchformer) utilizes gated cross-attention adapters to fuse global and focal features. Mask embeddings are constructed via a learnable linear 3D convolutional patch embedding, initialized to zero, and summed with video patch embeddings and positional encodings.
  • Textual Encoder: A frozen CLIP-based encoder extracts FtF_t from the prompt M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}0, with output dimensionality M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}1.
  • Audio Encoder and Decoder: Audio is represented in either short-time Fourier or latent diffusion embedding space. The audio decoder—a DiT model operating within a Conditional Flow Matching (CFM) framework—predicts the velocity direction for each reverse diffusion step, conditioning on segmentation-aware fused visual features M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}2, global CLIP video features M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}3, textual features M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}4, and noise M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}5.
  • Optional LoRA Fine-Tuning: Low-rank adapters (rank M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}6, M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}7) augment attention projections associated specifically with segmentation features in DiT, providing efficient parameter adaptation during fine-tuning.

The cross-attention fusion mechanism allows gradual integration of segmentation-relevant context into the focal visual stream after each transformer block, with gating scalars M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}8 and M∈{0,1}Tv×H×W×1M \in \{0,1\}^{T_v \times H \times W \times 1}9 initialized to zero for a staged training effect.

3. Objective Functions and Training Protocol

SAGANet is trained using a combination of distributional, control-specific, and perceptual objectives, leveraging the following:

tt0

with tt1; CFM focuses on regressing audio velocity fields in diffusion models.

  • Mask-Consistency Loss:

tt2

where tt3, to encourage sensitivity of audio output to mask manipulations.

  • Perceptual Audio Loss: An optional term using learned embeddings tt4, such as PANN, to minimize audio perceptual differences:

tt5

  • Total Objective:

tt6, with typical hyperparameter weights tt7, tt8.

  • Optimization: AdamW (tt9, A∈RTaA \in \mathbb{R}^{T_a}0, A∈RTaA \in \mathbb{R}^{T_a}1), with a training budget of 4A∈RTaA \in \mathbb{R}^{T_a}2A100 GPUs (40 GB) for approximately 40 epochs.

4. Segmented Music Solos Dataset

The Segmented Music Solos benchmark underpins segmentation-aware evaluation by furnishing high-quality, temporally aligned solos with dense mask annotation:

Data Split Clips Duration Classes Notes
Train 5,395 5 s 25 Solo Instruments
Validation 665 5 s 25 Solo Instruments
Test 745 5 s 25 URMP multi-source

Dataset construction follows a multistage pipeline: (1) instrument solo extraction from MUSIC21/AVSBench/Solos, (2) automatic visual verification via classifier and MPNet matching, (3) auditory verification using 5 s audio windows and AST+MPNet matching, (4) clip extraction with object persistence, and (5) mask annotation using GroundedSAM2, Florence-2, and manual refinement for test splits. All clips are 5 s, with video at 25 FPS (125 frames/clip), and audio sampled at 44.1 kHz (220,500 samples/clip) (Viertola et al., 30 Sep 2025).

5. Experimental Evaluation and Results

Performance is measured across distributional, semantic, and temporal metrics:

  • Distribution Matching: Frechet Distance (FD) on VGGish, PANNs, and PaSST embeddings; KL divergence on codebooks.
  • Audio Quality: Inception Score (IS via PANNs).
  • Semantic Alignment: ImageBind cosine similarity (IB-score).
  • Temporal Synchronization: DeSync, computed as Synchformer offset error in seconds.

Main Results

Metric MMAudio Base SAGANet SAGANet+LoRA
FD_PaSST (↓) 480.2 390.7 372.9
FD_PANNs (↓) 22.5 17.8 16.4
FD_VGG (↓) 12.8 11.1 10.5
KL_PANNs (↓) 1.12 0.81 0.75
KL_PaSST (↓) 0.93 0.67 0.62
IS (↑) 2.30 2.55 2.62
IB-score (↑) 36.0 42.5 44.1
DeSync (s) (↓) 0.96 0.42 0.35

SAGANet achieves a reduction in Frechet Distance of 18–25%, an A∈RTaA \in \mathbb{R}^{T_a}318% increase in IB-score, and a halving of temporal synchronization error (DeSync). LoRA fine-tuning further improves all metrics, notably reducing DeSync to 0.35 s.

In ablations, combining global and local visual context with mask channel yields the best quality and timing: focal crops alone improve sync at the expense of overall audio quality, while full-frame inputs underperform in synchronization.

Qualitatively, spectrograms demonstrate SAGANet’s note onset alignment within ±50 ms of visual events (vs. ±300 ms for base), and professional audio evaluators rate its object-level audio focus substantially higher (4.3/5 vs. 2.1/5 for base). In multi-instrument test settings, SAGANet maintains segmentation fidelity whereas the base model merges multiple audio sources.

6. Limitations and Future Research Directions

SAGANet is currently trained on single-source videos, and fully unsupervised mask discovery for multi-source scenes remains unsolved. Expanding to more complex environments with object interactions (e.g., collisions, overlapping motions) may necessitate hierarchical or graph-based mask encoding architectures. Additional research directions include real-time inference, user-interactive mask manipulation for live Foley workflows, and domain adaptation to environmental or speech sounds via segmentation of non-rigid or amorphous objects (Viertola et al., 30 Sep 2025).

7. Significance in Controllable Audio Generation

SAGANet introduces video object segmentation-aware audio generation, setting a new empirical benchmark for object-level Foley synthesis. By conditioning audio on explicit segmentation masks, it empowers users with precise, visually routed generative control in professional workflows and establishes the Segmented Music Solos dataset as a foundation for future research. These contributions and the demonstrated performance advances move the field toward artist-friendly, object-centric multimodal generation (Viertola et al., 30 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAGANet.