AudioStory: End-to-End Narrative Audio Synthesis

Updated 8 February 2026

AudioStory is a research-driven generation paradigm combining LLMs with TTS and diffusion models for long-form narrative audio.
The system decomposes high-level text into timed events using semantic and residual bridges, ensuring global coherence and emotional alignment.
Authoring tools leverage multimodal alignment and node-based editing to seamlessly synchronize narration, sound effects, and music.

AudioStory is a research-driven paradigm and the title of several recent systems dedicated to the end-to-end generation, authoring, and evaluation of long-form narrative audio, typically driven by structured text, prompts, and multimodal context. Systems under the AudioStory umbrella leverage advances in neural text-to-speech (TTS), diffusion-based audio generation, and LLMs to synthesize spoken narration, sound design, music, and environmental soundscapes, tightly synchronized to complex story graphs or event sequences. AudioStory research focuses on compositional generation, consistent prosody and speaker characteristics, multimodal alignment, and author-centric tools for interactive refinement.

1. AudioStory: Core Architectures and Algorithms

Recent implementations of AudioStory unify LLM-based planning with modern TTA (text-to-audio) generative models to produce long narrative audio exhibiting global coherence, intrasegmental semantic alignment, and emotional consistency. The most architecturally explicit approach is presented in "AudioStory: Generating Long-Form Narrative Audio with LLMs" (Guo et al., 27 Aug 2025).

The system operates by iteratively decomposing a high-level story instruction into a sequence of temporally distributed events, each described by text (caption), duration, and emotional tone. For step $t$ , LLM embeddings $z_t$ are mapped to a semantic bridge $B_t \in \mathbb{R}^{d \times N}$ (capturing intra-event semantics) and a residual bridge $R_t \in \mathbb{R}^{d \times M}$ (capturing inter-event coherence and unmodeled acoustic context):

$T_\text{semantic}^t = W_\text{sem}(z_t \| c_{t-1}) \quad T_\text{residual}^t = W_\text{res}(z_t \| h_{t-1})$

$B_t = \operatorname{CrossAttn}(Q=T_\text{semantic}^t, K=T_\text{residual}^t, V=T_\text{residual}^t)$

The diffusion TTA model is conditioned on $B_t$ , minimizing a flow-matching loss:

$L_\text{flow} = \mathbb{E}_{x_0, \tau} \left\| u(x_\tau, \tau; c=B_t) - v_\tau \right\|^2$

Training is end-to-end, combining MSE regression on semantic tokens, next-token loss on event text, and diffusion flow loss. Causal context and tone labels are passed between events to support emotional consistency and continuity.

A key innovation is the explicit decomposition of LLM–generator interaction into semantic and residual bridges, facilitating both local semantic fidelity and global narrative flow. The entire process is interleaved (i.e., each event's audio is generated while updating the global context token state), closing the loop between LLM reasoning and acoustic realization (Guo et al., 27 Aug 2025).

2. Node-Based AudioStory Authoring and Multimodal Consistency

The node-based paradigm, exemplified in "Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video" (Kyaw et al., 5 Nov 2025), frames stories as directed graphs $G = (V, E)$ , with each node $v \in V$ containing a text segment $T_v$ and optional image $I_v$ , video $V_v$ , and audio $A_v$ . Nodes are rendered by dispatching $T_v$ (plus rolling context $C$ ) to each media generator:

$T_v \rightarrow$ TTS $\rightarrow A_v$
$T_v, C \rightarrow$ image and video generators
All $A_v$ , $I_v$ , $V_v$ remain tightly aligned with $T_v$ , ensuring multimodal consistency

Audio is always coupled to text, never forming independent graph branches. Style control is managed interactively via natural-language prompts, which are paired with text at the API call (e.g., API.call("gpt-4o-tts", text=T_v, style="mysterious") → A_v). Editing $T_v$ directly and triggering audio regeneration closes the loop between script authorship and speech output. The system does not expose fine-grained prosody controls or speaker embeddings: voice characteristics are slaved to the provider/voice selection per node or branch.

Reported observations are qualitative—audio “instantly attaches to a node” and “follows the text segment”—with no quantitative metrics (e.g., MOS, alignment scores) or discussion of error modes. Workflow heuristics include keeping segments short, synchronizing style and provider across branches, and leveraging the rolling context for name and motif stability (Kyaw et al., 5 Nov 2025).

3. Event-Level Decomposition, Alignment, and Synchronization

Modern AudioStory architectures, including MM-StoryAgent (Xu et al., 7 Mar 2025) and WavJourney (Liu et al., 2023), stress compositionality: transforming stories into temporally aligned segments across narration, SFX, and music.

In MM-StoryAgent (Xu et al., 7 Mar 2025), story text is decomposed into pages or sentence blocks. Voice narration (CosyVoice), SFX (AudioLDM 2), and music (MusicGen) are batch-generated for each segment.
Sound effect events $s_k$ are scheduled at absolute time $\tau_k = T_{p-1} + \Delta_k$ (where $T_{p-1}$ is the cumulative narration time before page $p$ , $\Delta_k$ is the intrapage offset), with all assets stretched or padded to fit their display intervals. Background music is looped or trimmed to total narration length $T_N$ . No machine-learned temporal alignment is used; all synchronization is deterministic.

WavJourney (Liu et al., 2023) leverages LLMs to emit structured JSON "AudioScripts," enumerating foreground/background events, audio types, start/end times, and attributes (e.g., speech: character, text, volume; sound effects/music: description, length, mixing parameters). Compilation produces a sequence of function calls, executed in order to synthesize and mix the complete audio.

This script-based or event-decomposed architecture affords precise control, supports user edits at the event level, and creates a compositional interface between text-based planning and low-level waveform synthesis.

4. Prosody, Expressiveness, and Speaker Consistency

Generating expressively narrated stories remains an open challenge. The "StoryTTS" corpus (Liu et al., 2024) demonstrates that detailed, multi-dimensional expressiveness annotations (sentence pattern, rhetorical device, scene, imitated character, emotional color) can be leveraged for TTS conditioning. By incorporating embeddings for each label category and emotional keyword (via BERT/Sentence-BERT), a TTS model trained on StoryTTS achieves a MOS of 4.09 (vs. 3.88 for baseline), significant F0 RMSE reduction, and increases in pitch dynamics and role-playing variance.

Speaker consistency across long stories is addressed via character persona embeddings, as in MultiActor-Audiobook (Park et al., 19 May 2025). The system builds a multimodal persona embedding $z_i$ for each character, surveying LLM-extracted textual descriptions, face images (Stable Diffusion), and voice exemplars (FleSpeech). Each sentence is paired with an LLM-generated instruction $u_t$ (specifying prosodic and emotional directives) and attributed to a persona. The synthesis step $\hat x_t = \mathrm{FleSpeech}(text=s_t, instruction=u_t, persona=z_{i(t)})$ maintains global voice continuity.

Ablation studies in MultiActor-Audiobook reveal that removing persona conditioning or instruction generation degrades speaker-consistent expressiveness (Char-Con and MOS-E metrics), confirming the necessity of both modules (Park et al., 19 May 2025).

5. Environmental Soundscapes and Multimodal Audio Generation

AudioStory systems increasingly support joint narration and soundscape generation. The "Sound of Story" dataset (Bae et al., 2023) introduces large-scale, tri-modal corpora for background sound and music. Non-speech audio is extracted from movie clips via speech separation, aligned with key images and captions. Benchmarks include retrieval (e.g., audio-to-video, audio-to-text) and diffusion-based conditional generation (from text and/or images). Cross-modal contrastive loss and Frechét Audio Distance (FAD) serve as evaluation metrics. Multi-condition diffusion with both text and image yields best FAD (9.099), outperforming Riffusion and MusicGen baselines.

From the system perspective, WavJourney (Liu et al., 2023) and the immersive audiobook framework (Selvamani et al., 8 May 2025) mix narration, SFX, and environmental audio using volume control, spatialization, and precise offset scheduling defined by script structure or NLP-driven temporal tags. In the multi-agent framework, spatial cues are generated from input text using scene parsing and GPT-4–driven instruction, with downstream asset synthesis via diffusion-based generative models and higher-order ambisonic representations.

6. Evaluation Metrics, Datasets, and Reported Results

AudioStory research utilizes a spectrum of objective and subjective metrics.

Instruction following: human or automated rating (0–5), CLAP text–audio similarity, Gemini AI scoring (Guo et al., 27 Aug 2025).
Consistency: timbre/entity persistence, event coherence scores (Guo et al., 27 Aug 2025), Char-Con, MOS-E (expressiveness), MOS-S (speaker ID) (Park et al., 19 May 2025).
Generation quality: Frechét distance (FD), Frechét Audio Distance (FAD), mel-cepstral distortion (MCD), log-F0 RMSE (Liu et al., 2024, Guo et al., 27 Aug 2025).
Subjective tests: mean opinion scores (MOS), ABX preference, listening immersion, emotional appropriateness (Xu et al., 7 Mar 2025, Kyaw et al., 5 Nov 2025, Selvamani et al., 8 May 2025).
Objective retrieval: Recall@K in tri-modal retrieval (Bae et al., 2023).
Benchmark datasets: AudioStory-10K (10k stories; natural and animated sounds) (Guo et al., 27 Aug 2025), StoryTTS (61 h Mandarin, fine-grained expressiveness) (Liu et al., 2024), SoS (Sound of Story, 984 h movie-based audio) (Bae et al., 2023), MultiActor-Audiobook (TED-derived faces, stories) (Park et al., 19 May 2025).

In long-form narrative generation, AudioStory (Guo et al., 27 Aug 2025) achieves $4.1$ CLAP (cosine, instruct-following), $4.1$ consistency, FAD $=3.00$ , and up to $150$ s coherent segments, outperforming AudioLDM2, TangoFlux, and hybrid pipelines.

7. Current Limitations and Prospects

Key challenges remain:

Absence of deep, speaker-specific training leads to occasional inconsistency in voice timbre or prosody upon node regeneration or across long branches (Kyaw et al., 5 Nov 2025, Park et al., 19 May 2025).
High-level style prompting suffices for coarse emotion but does not yield fine-grained prosodic control (e.g., pitch contours, pauses, emphasis) (Kyaw et al., 5 Nov 2025).
Lack of learned alignment or spatialization models in most frameworks; synchronization is typically rule-based unless specialized (e.g., DTW, spatial diffusion) (Xu et al., 7 Mar 2025, Selvamani et al., 8 May 2025).
Integration of environmental and diegetic sound remains primitive outside of advanced spatial audio agent systems (Selvamani et al., 8 May 2025), and joint learning of narration and SFX is rare.
Quantitative evaluation is underdeveloped in some interfaces, with limited MOS, listening tests, or error analysis (Kyaw et al., 5 Nov 2025).

Future research directions include incorporating cross-modal attention/alignment modules, multi-dimensional expressiveness labeling, human-in-the-loop revision cycles, emotion-planning modules, and integration of fine-grained spatial and acoustic controls. The modular, script-like structure (see WavJourney) and graph-based authoring interfaces (see (Kyaw et al., 5 Nov 2025)) allow rapid iteration and facilitate user-centered creative workflows.

Selected Reference Table

System / Dataset	Key Innovations	Reference
AudioStory	LLM–diffusion bridge, joint E2E long-form audio	(Guo et al., 27 Aug 2025)
Node-Graph UI	Authoring/editing, multimodal node consistency	(Kyaw et al., 5 Nov 2025)
StoryTTS	Expressiveness annotation + conditioning	(Liu et al., 2024)
MultiActor-ABook	Persona/Emotion planners, zero-shot TTS	(Park et al., 19 May 2025)
SoS	Large-scale tri-modal BGM dataset	(Bae et al., 2023)
MM-StoryAgent	Multi-agent, SFX/music+voice, open APIs	(Xu et al., 7 Mar 2025)
WavJourney	LLM-scripted audio composition	(Liu et al., 2023)
Immersive Audiobook	3D spatial audio, multi-agent composition	(Selvamani et al., 8 May 2025)

Markdown Upgrade to Chat

References (8)

AudioStory: Generating Long-Form Narrative Audio with Large Language Models (2025)

Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Vide (2025)

MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio (2025)

WavJourney: Compositional Audio Creation with Large Language Models (2023)

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations (2024)

MultiActor-Audiobook: Zero-Shot Audiobook Generation with Faces and Voices of Multiple Speakers (2025)

Sound of Story: Multi-modal Storytelling with Audio (2023)

A Multi-Agent AI Framework for Immersive Audiobook Production through Spatial Audio and Neural Narration (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AudioStory.

AudioStory: End-to-End Narrative Audio Synthesis

1. AudioStory: Core Architectures and Algorithms

2. Node-Based AudioStory Authoring and Multimodal Consistency

3. Event-Level Decomposition, Alignment, and Synchronization

4. Prosody, Expressiveness, and Speaker Consistency

5. Environmental Soundscapes and Multimodal Audio Generation

6. Evaluation Metrics, Datasets, and Reported Results

7. Current Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

AudioStory: End-to-End Narrative Audio Synthesis

1. AudioStory: Core Architectures and Algorithms

2. Node-Based AudioStory Authoring and Multimodal Consistency

3. Event-Level Decomposition, Alignment, and Synchronization

4. Prosody, Expressiveness, and Speaker Consistency

5. Environmental Soundscapes and Multimodal Audio Generation

6. Evaluation Metrics, Datasets, and Reported Results

7. Current Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research