SAGANet: Segmentation-Aware Audio Generation
- SAGANet is a multimodal model that explicitly conditions audio generation on segmentation masks to achieve object-specific Foley synthesis.
- It integrates dual visual streams and CLIP-based text encoding, leveraging both global and localized context for improved temporal alignment.
- Experimental results show significant improvements in Frechet Distance, IB-score, and synchronization error, with LoRA fine-tuning further enhancing performance.
SAGANet is a multimodal generative model for video object segmentation-aware audio generation, enabling controllable, high-fidelity Foley synthesis by explicitly conditioning audio synthesis on object-level visual segmentation masks. By leveraging both global and localized visual context alongside optional textual prompts, SAGANet provides fine-grained, visually localized user control that outperforms prior state-of-the-art models in precision, synchronization, and semantic alignment within the domain of audiovisual generation (Viertola et al., 30 Sep 2025).
1. Task Definition and Motivation
Video object segmentation-aware audio generation is defined by its explicit input–output structure:
- Input: A sequence of video frames , corresponding binary segmentation masks that specify the object of interest in each frame, and optionally a textual prompt .
- Output: An audio waveform (e.g., a 5 s clip at 44.1 kHz) containing sound predominantly attributable to the masked object.
- Formal Objective: , where is trained to focus its auditory generation on the spatiotemporal region defined by , preserving contextual scene information from .
Prior video-conditioned audio generation models operate at the scene level and lack the ability to provide sound solely for a specified object, leading to uncontrolled sound-source mixing and poor temporal alignment. These models also often require prohibitive training resources, restricting reproducibility. SAGANet addresses these limitations by introducing mask-based explicit object control, utilizing a lightweight segmentation-aware module and dual visual streams to achieve precise object-level control, efficient training, and strong generalization to multi-source scenes when trained exclusively on single-instrument data (Viertola et al., 30 Sep 2025).
2. Model Architecture
SAGANet's architecture integrates segmentation-based conditioning through the following modules:
- Visual Encoder (Segmentation-Aware Control): Dual spatiotemporal streams process both full-frame video with global masks and tightly cropped (minimum px) focal regions around mask bounding boxes. A shared ViT backbone (Synchformer) utilizes gated cross-attention adapters to fuse global and focal features. Mask embeddings are constructed via a learnable linear 3D convolutional patch embedding, initialized to zero, and summed with video patch embeddings and positional encodings.
- Textual Encoder: A frozen CLIP-based encoder extracts from the prompt 0, with output dimensionality 1.
- Audio Encoder and Decoder: Audio is represented in either short-time Fourier or latent diffusion embedding space. The audio decoder—a DiT model operating within a Conditional Flow Matching (CFM) framework—predicts the velocity direction for each reverse diffusion step, conditioning on segmentation-aware fused visual features 2, global CLIP video features 3, textual features 4, and noise 5.
- Optional LoRA Fine-Tuning: Low-rank adapters (rank 6, 7) augment attention projections associated specifically with segmentation features in DiT, providing efficient parameter adaptation during fine-tuning.
The cross-attention fusion mechanism allows gradual integration of segmentation-relevant context into the focal visual stream after each transformer block, with gating scalars 8 and 9 initialized to zero for a staged training effect.
3. Objective Functions and Training Protocol
SAGANet is trained using a combination of distributional, control-specific, and perceptual objectives, leveraging the following:
- Conditional Flow Matching (CFM) Loss:
0
with 1; CFM focuses on regressing audio velocity fields in diffusion models.
- Mask-Consistency Loss:
2
where 3, to encourage sensitivity of audio output to mask manipulations.
- Perceptual Audio Loss: An optional term using learned embeddings 4, such as PANN, to minimize audio perceptual differences:
5
- Total Objective:
6, with typical hyperparameter weights 7, 8.
- Optimization: AdamW (9, 0, 1), with a training budget of 42A100 GPUs (40 GB) for approximately 40 epochs.
4. Segmented Music Solos Dataset
The Segmented Music Solos benchmark underpins segmentation-aware evaluation by furnishing high-quality, temporally aligned solos with dense mask annotation:
| Data Split | Clips | Duration | Classes | Notes |
|---|---|---|---|---|
| Train | 5,395 | 5 s | 25 | Solo Instruments |
| Validation | 665 | 5 s | 25 | Solo Instruments |
| Test | 745 | 5 s | 25 | URMP multi-source |
Dataset construction follows a multistage pipeline: (1) instrument solo extraction from MUSIC21/AVSBench/Solos, (2) automatic visual verification via classifier and MPNet matching, (3) auditory verification using 5 s audio windows and AST+MPNet matching, (4) clip extraction with object persistence, and (5) mask annotation using GroundedSAM2, Florence-2, and manual refinement for test splits. All clips are 5 s, with video at 25 FPS (125 frames/clip), and audio sampled at 44.1 kHz (220,500 samples/clip) (Viertola et al., 30 Sep 2025).
5. Experimental Evaluation and Results
Performance is measured across distributional, semantic, and temporal metrics:
- Distribution Matching: Frechet Distance (FD) on VGGish, PANNs, and PaSST embeddings; KL divergence on codebooks.
- Audio Quality: Inception Score (IS via PANNs).
- Semantic Alignment: ImageBind cosine similarity (IB-score).
- Temporal Synchronization: DeSync, computed as Synchformer offset error in seconds.
Main Results
| Metric | MMAudio Base | SAGANet | SAGANet+LoRA |
|---|---|---|---|
| FD_PaSST (↓) | 480.2 | 390.7 | 372.9 |
| FD_PANNs (↓) | 22.5 | 17.8 | 16.4 |
| FD_VGG (↓) | 12.8 | 11.1 | 10.5 |
| KL_PANNs (↓) | 1.12 | 0.81 | 0.75 |
| KL_PaSST (↓) | 0.93 | 0.67 | 0.62 |
| IS (↑) | 2.30 | 2.55 | 2.62 |
| IB-score (↑) | 36.0 | 42.5 | 44.1 |
| DeSync (s) (↓) | 0.96 | 0.42 | 0.35 |
SAGANet achieves a reduction in Frechet Distance of 18–25%, an 318% increase in IB-score, and a halving of temporal synchronization error (DeSync). LoRA fine-tuning further improves all metrics, notably reducing DeSync to 0.35 s.
In ablations, combining global and local visual context with mask channel yields the best quality and timing: focal crops alone improve sync at the expense of overall audio quality, while full-frame inputs underperform in synchronization.
Qualitatively, spectrograms demonstrate SAGANet’s note onset alignment within ±50 ms of visual events (vs. ±300 ms for base), and professional audio evaluators rate its object-level audio focus substantially higher (4.3/5 vs. 2.1/5 for base). In multi-instrument test settings, SAGANet maintains segmentation fidelity whereas the base model merges multiple audio sources.
6. Limitations and Future Research Directions
SAGANet is currently trained on single-source videos, and fully unsupervised mask discovery for multi-source scenes remains unsolved. Expanding to more complex environments with object interactions (e.g., collisions, overlapping motions) may necessitate hierarchical or graph-based mask encoding architectures. Additional research directions include real-time inference, user-interactive mask manipulation for live Foley workflows, and domain adaptation to environmental or speech sounds via segmentation of non-rigid or amorphous objects (Viertola et al., 30 Sep 2025).
7. Significance in Controllable Audio Generation
SAGANet introduces video object segmentation-aware audio generation, setting a new empirical benchmark for object-level Foley synthesis. By conditioning audio on explicit segmentation masks, it empowers users with precise, visually routed generative control in professional workflows and establishes the Segmented Music Solos dataset as a foundation for future research. These contributions and the demonstrated performance advances move the field toward artist-friendly, object-centric multimodal generation (Viertola et al., 30 Sep 2025).