PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description (2509.00683v2)

Published 31 Aug 2025 in cs.SD and eess.AS

Abstract: While recent work in controllable text-to-audio (TTA) generation has achieved fine-grained control through timestamp conditioning, its scope remains limited by audio quality and input format. These models often suffer from poor audio quality in real datasets due to sole reliance on synthetic data. Moreover, some models are constrained to a closed vocabulary of sound events, preventing them from controlling audio generation for open-ended, free-text queries. This paper introduces PicoAudio2, a framework that advances temporal-controllable TTA by mitigating these data and architectural limitations. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, we propose an enhanced architecture that integrates the fine-grained information from a timestamp matrix with coarse-grained free-text input. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

Summary

The paper introduces a novel timestamp matrix that enables precise temporal control in text-to-audio synthesis, resolving limitations in previous models.
It employs a dual data approach by converting temporally coarse captions into detailed, timestamped annotations using both real and synthetic datasets.
Experiments show that PicoAudio2 outperforms existing methods with improved metrics (FD, IS, MOS, and Seg-F₁), demonstrating its effectiveness in producing high-quality, temporally accurate audio.

PicoAudio2: Advancements in Text-to-Audio Generation with Temporal Control

Introduction

The paper presents PicoAudio2, a framework designed for text-to-audio (TTA) generation with enhanced temporal controllability. Unlike previous models which primarily rely on synthetic datasets, PicoAudio2 incorporates a combination of real and simulated data, improving both audio quality and flexibility in processing free-text queries. This addresses significant limitations seen with models like Make-An-Audio 2 (MAA2) and AudioComposer, which either lack fine temporal control or suffer in quality due to dataset constraints. The introduction of a timestamp matrix aligned with free-text inputs represents a notable architectural enhancement in achieving precise audio generation.

Temporally-Aligned Data Curation

The challenge in TTA models is the alignment of audio with natural language descriptions. Traditionally, datasets like AudioCaps provide audio-caption pairs without precise temporal annotations, limiting the temporal control of generated audio. PicoAudio2 tackles this by developing robust pipelines for both synthetic and real data, converting temporally coarse captions (TCC) into temporally detailed captions (TDC), hence producing temporally strong datasets.

Simulation Data

For synthetic data, AudioTime's approach is extended by mapping categorical labels to free-text via LLMs and then aligning these labels with timestamps (Section 2). The conversion from structured categories to free text enhances the flexibility and richness of the descriptions compatible with PicoAudio2's learning objectives.

Real Data

In real data processing, PicoAudio2 utilizes the Text-to-Audio Grounding (TAG) model to extract timestamps, which are further curated to avoid overlap and omissions, ensuring high-quality dataset creation. This dual approach not only bridges the gap between real and simulated dataset distributions but also enriches the model's versatility in handling natural language inputs.

Figure 1: The data curation pipeline.

PicoAudio2 Framework

The framework is built around a Diffusion Transformer backbone, a timestamp encoder, and a VAE. A key innovation is the introduction of a timestamp matrix, which extracts detailed temporal cues from TDC, while allowing temporally weak data to be augmented strategically. By separating temporal cues from textual semantics, this design achieves effective control over event timing within generated audio, a significant improvement over prior methods that lacked this granularity.

Figure 2: PicoAudio2 framework.

Experiments and Evaluation

Quantitative evaluations showcased PicoAudio2's capabilities over leading models. It delivers superior temporal accuracy and audio quality, validated through metrics like FD and IS, as well as subjective evaluations such as MOS (Section 5). The temporal controllability is notably improved due to PicoAudio2's timestamp matrix, as evidenced in Seg-F $_1$ measurements which outperform existing benchmarks.

Results

Figure 3: Temporal control transformation during inference.

Conclusion

PicoAudio2 addresses the critical issue of temporal alignment in TTA by integrating timestamped data annotations with a flexible model architecture that comprehends free-text inputs. The framework's reliance on both simulated and real datasets ensures a robust audio generation pipeline, capable of delivering enhanced audio quality and precise temporal control. The proposed methodology opens avenues for future work in improving annotations for overlapping events, thereby enhancing real-world applicability.

In conclusion, PicoAudio2 represents a significant step forward in synthesizing temporally accurate and high-quality audio outputs from natural language inputs, offering both theoretical contributions to the field and practical solutions adaptable to audio-based applications.