Papers
Topics
Authors
Recent
2000 character limit reached

AudioStory: End-to-End Narrative Audio Synthesis

Updated 8 February 2026
  • AudioStory is a research-driven generation paradigm combining LLMs with TTS and diffusion models for long-form narrative audio.
  • The system decomposes high-level text into timed events using semantic and residual bridges, ensuring global coherence and emotional alignment.
  • Authoring tools leverage multimodal alignment and node-based editing to seamlessly synchronize narration, sound effects, and music.

AudioStory is a research-driven paradigm and the title of several recent systems dedicated to the end-to-end generation, authoring, and evaluation of long-form narrative audio, typically driven by structured text, prompts, and multimodal context. Systems under the AudioStory umbrella leverage advances in neural text-to-speech (TTS), diffusion-based audio generation, and LLMs to synthesize spoken narration, sound design, music, and environmental soundscapes, tightly synchronized to complex story graphs or event sequences. AudioStory research focuses on compositional generation, consistent prosody and speaker characteristics, multimodal alignment, and author-centric tools for interactive refinement.

1. AudioStory: Core Architectures and Algorithms

Recent implementations of AudioStory unify LLM-based planning with modern TTA (text-to-audio) generative models to produce long narrative audio exhibiting global coherence, intrasegmental semantic alignment, and emotional consistency. The most architecturally explicit approach is presented in "AudioStory: Generating Long-Form Narrative Audio with LLMs" (Guo et al., 27 Aug 2025).

The system operates by iteratively decomposing a high-level story instruction into a sequence of temporally distributed events, each described by text (caption), duration, and emotional tone. For step tt, LLM embeddings ztz_t are mapped to a semantic bridge BtRd×NB_t \in \mathbb{R}^{d \times N} (capturing intra-event semantics) and a residual bridge RtRd×MR_t \in \mathbb{R}^{d \times M} (capturing inter-event coherence and unmodeled acoustic context):

Tsemantict=Wsem(ztct1)Tresidualt=Wres(ztht1)T_\text{semantic}^t = W_\text{sem}(z_t \| c_{t-1}) \quad T_\text{residual}^t = W_\text{res}(z_t \| h_{t-1})

Bt=CrossAttn(Q=Tsemantict,K=Tresidualt,V=Tresidualt)B_t = \operatorname{CrossAttn}(Q=T_\text{semantic}^t, K=T_\text{residual}^t, V=T_\text{residual}^t)

The diffusion TTA model is conditioned on BtB_t, minimizing a flow-matching loss:

Lflow=Ex0,τu(xτ,τ;c=Bt)vτ2L_\text{flow} = \mathbb{E}_{x_0, \tau} \left\| u(x_\tau, \tau; c=B_t) - v_\tau \right\|^2

Training is end-to-end, combining MSE regression on semantic tokens, next-token loss on event text, and diffusion flow loss. Causal context and tone labels are passed between events to support emotional consistency and continuity.

A key innovation is the explicit decomposition of LLM–generator interaction into semantic and residual bridges, facilitating both local semantic fidelity and global narrative flow. The entire process is interleaved (i.e., each event's audio is generated while updating the global context token state), closing the loop between LLM reasoning and acoustic realization (Guo et al., 27 Aug 2025).

2. Node-Based AudioStory Authoring and Multimodal Consistency

The node-based paradigm, exemplified in "Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video" (Kyaw et al., 5 Nov 2025), frames stories as directed graphs G=(V,E)G = (V, E), with each node vVv \in V containing a text segment TvT_v and optional image IvI_v, video VvV_v, and audio AvA_v. Nodes are rendered by dispatching TvT_v (plus rolling context CC) to each media generator:

  • TvT_v \rightarrow TTS Av\rightarrow A_v
  • Tv,CT_v, C \rightarrow image and video generators
  • All AvA_v, IvI_v, VvV_v remain tightly aligned with TvT_v, ensuring multimodal consistency

Audio is always coupled to text, never forming independent graph branches. Style control is managed interactively via natural-language prompts, which are paired with text at the API call (e.g., API.call("gpt-4o-tts", text=T_v, style="mysterious") → A_v). Editing TvT_v directly and triggering audio regeneration closes the loop between script authorship and speech output. The system does not expose fine-grained prosody controls or speaker embeddings: voice characteristics are slaved to the provider/voice selection per node or branch.

Reported observations are qualitative—audio “instantly attaches to a node” and “follows the text segment”—with no quantitative metrics (e.g., MOS, alignment scores) or discussion of error modes. Workflow heuristics include keeping segments short, synchronizing style and provider across branches, and leveraging the rolling context for name and motif stability (Kyaw et al., 5 Nov 2025).

3. Event-Level Decomposition, Alignment, and Synchronization

Modern AudioStory architectures, including MM-StoryAgent (Xu et al., 7 Mar 2025) and WavJourney (Liu et al., 2023), stress compositionality: transforming stories into temporally aligned segments across narration, SFX, and music.

  • In MM-StoryAgent (Xu et al., 7 Mar 2025), story text is decomposed into pages or sentence blocks. Voice narration (CosyVoice), SFX (AudioLDM 2), and music (MusicGen) are batch-generated for each segment.
  • Sound effect events sks_k are scheduled at absolute time τk=Tp1+Δk\tau_k = T_{p-1} + \Delta_k (where Tp1T_{p-1} is the cumulative narration time before page pp, Δk\Delta_k is the intrapage offset), with all assets stretched or padded to fit their display intervals. Background music is looped or trimmed to total narration length TNT_N. No machine-learned temporal alignment is used; all synchronization is deterministic.

WavJourney (Liu et al., 2023) leverages LLMs to emit structured JSON "AudioScripts," enumerating foreground/background events, audio types, start/end times, and attributes (e.g., speech: character, text, volume; sound effects/music: description, length, mixing parameters). Compilation produces a sequence of function calls, executed in order to synthesize and mix the complete audio.

This script-based or event-decomposed architecture affords precise control, supports user edits at the event level, and creates a compositional interface between text-based planning and low-level waveform synthesis.

4. Prosody, Expressiveness, and Speaker Consistency

Generating expressively narrated stories remains an open challenge. The "StoryTTS" corpus (Liu et al., 2024) demonstrates that detailed, multi-dimensional expressiveness annotations (sentence pattern, rhetorical device, scene, imitated character, emotional color) can be leveraged for TTS conditioning. By incorporating embeddings for each label category and emotional keyword (via BERT/Sentence-BERT), a TTS model trained on StoryTTS achieves a MOS of 4.09 (vs. 3.88 for baseline), significant F0 RMSE reduction, and increases in pitch dynamics and role-playing variance.

Speaker consistency across long stories is addressed via character persona embeddings, as in MultiActor-Audiobook (Park et al., 19 May 2025). The system builds a multimodal persona embedding ziz_i for each character, surveying LLM-extracted textual descriptions, face images (Stable Diffusion), and voice exemplars (FleSpeech). Each sentence is paired with an LLM-generated instruction utu_t (specifying prosodic and emotional directives) and attributed to a persona. The synthesis step x^t=FleSpeech(text=st,instruction=ut,persona=zi(t))\hat x_t = \mathrm{FleSpeech}(text=s_t, instruction=u_t, persona=z_{i(t)}) maintains global voice continuity.

Ablation studies in MultiActor-Audiobook reveal that removing persona conditioning or instruction generation degrades speaker-consistent expressiveness (Char-Con and MOS-E metrics), confirming the necessity of both modules (Park et al., 19 May 2025).

5. Environmental Soundscapes and Multimodal Audio Generation

AudioStory systems increasingly support joint narration and soundscape generation. The "Sound of Story" dataset (Bae et al., 2023) introduces large-scale, tri-modal corpora for background sound and music. Non-speech audio is extracted from movie clips via speech separation, aligned with key images and captions. Benchmarks include retrieval (e.g., audio-to-video, audio-to-text) and diffusion-based conditional generation (from text and/or images). Cross-modal contrastive loss and Frechét Audio Distance (FAD) serve as evaluation metrics. Multi-condition diffusion with both text and image yields best FAD (9.099), outperforming Riffusion and MusicGen baselines.

From the system perspective, WavJourney (Liu et al., 2023) and the immersive audiobook framework (Selvamani et al., 8 May 2025) mix narration, SFX, and environmental audio using volume control, spatialization, and precise offset scheduling defined by script structure or NLP-driven temporal tags. In the multi-agent framework, spatial cues are generated from input text using scene parsing and GPT-4–driven instruction, with downstream asset synthesis via diffusion-based generative models and higher-order ambisonic representations.

6. Evaluation Metrics, Datasets, and Reported Results

AudioStory research utilizes a spectrum of objective and subjective metrics.

In long-form narrative generation, AudioStory (Guo et al., 27 Aug 2025) achieves $4.1$ CLAP (cosine, instruct-following), $4.1$ consistency, FAD =3.00=3.00, and up to $150$ s coherent segments, outperforming AudioLDM2, TangoFlux, and hybrid pipelines.

7. Current Limitations and Prospects

Key challenges remain:

  • Absence of deep, speaker-specific training leads to occasional inconsistency in voice timbre or prosody upon node regeneration or across long branches (Kyaw et al., 5 Nov 2025, Park et al., 19 May 2025).
  • High-level style prompting suffices for coarse emotion but does not yield fine-grained prosodic control (e.g., pitch contours, pauses, emphasis) (Kyaw et al., 5 Nov 2025).
  • Lack of learned alignment or spatialization models in most frameworks; synchronization is typically rule-based unless specialized (e.g., DTW, spatial diffusion) (Xu et al., 7 Mar 2025, Selvamani et al., 8 May 2025).
  • Integration of environmental and diegetic sound remains primitive outside of advanced spatial audio agent systems (Selvamani et al., 8 May 2025), and joint learning of narration and SFX is rare.
  • Quantitative evaluation is underdeveloped in some interfaces, with limited MOS, listening tests, or error analysis (Kyaw et al., 5 Nov 2025).

Future research directions include incorporating cross-modal attention/alignment modules, multi-dimensional expressiveness labeling, human-in-the-loop revision cycles, emotion-planning modules, and integration of fine-grained spatial and acoustic controls. The modular, script-like structure (see WavJourney) and graph-based authoring interfaces (see (Kyaw et al., 5 Nov 2025)) allow rapid iteration and facilitate user-centered creative workflows.


Selected Reference Table

System / Dataset Key Innovations Reference
AudioStory LLM–diffusion bridge, joint E2E long-form audio (Guo et al., 27 Aug 2025)
Node-Graph UI Authoring/editing, multimodal node consistency (Kyaw et al., 5 Nov 2025)
StoryTTS Expressiveness annotation + conditioning (Liu et al., 2024)
MultiActor-ABook Persona/Emotion planners, zero-shot TTS (Park et al., 19 May 2025)
SoS Large-scale tri-modal BGM dataset (Bae et al., 2023)
MM-StoryAgent Multi-agent, SFX/music+voice, open APIs (Xu et al., 7 Mar 2025)
WavJourney LLM-scripted audio composition (Liu et al., 2023)
Immersive Audiobook 3D spatial audio, multi-agent composition (Selvamani et al., 8 May 2025)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AudioStory.