Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 161 tok/s Pro

GPT OSS 120B 412 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

Live Music Models (2508.04651v2)

Published 6 Aug 2025 in cs.SD, cs.HC, and cs.LG

Abstract: We introduce a new class of generative models for music called live music models that produce a continuous stream of music in real-time with synchronized user control. We release Magenta RealTime, an open-weights live music model that can be steered using text or audio prompts to control acoustic style. On automatic metrics of music quality, Magenta RealTime outperforms other open-weights music generation models, despite using fewer parameters and offering first-of-its-kind live generation capabilities. We also release Lyria RealTime, an API-based model with extended controls, offering access to our most powerful model with wide prompt coverage. These models demonstrate a new paradigm for AI-assisted music creation that emphasizes human-in-the-loop interaction for live music performance.

Summary

The paper introduces live music models that enable real-time interactive music generation via AI-driven codec language models and dynamic style embeddings.
It employs SpectroStream for audio tokenization and MusicCoCa for joint audio-text style control to maintain high-quality, infinite streaming outputs.
Experiments demonstrate Magenta RT outperforms competitors with superior fidelity, low latency, and flexible user control, revolutionizing live music production.

Live Music Models

Introduction

The paper introduces a novel class of generative AI models designed for real-time music creation, termed "live music models." These models, Magenta RealTime (RT) and Lyria RealTime (RT), emphasize human-in-the-loop interaction, allowing real-time control of music generation through continuous user input. The paper details the technological advancements achieved through these models, including first-of-its-kind live generation capabilities via open weights and API-based systems, offering flexible controls for artists and users.

Figure 1: Magenta RealTime is a live music model that generates an uninterrupted stream of music and responds continuously to user input. It generates audio in two-second chunks using a pipeline with three components: MusicCoCa, a style embedding model, and SpectroStream.

Methodology

Magenta RT employs a codec LLM structure, enabling high-fidelity stereo audio generation conditioned on user-specified acoustic styles. The process involves encoding music into tokens using SpectroStream, a neural audio codec, and style embeddings computed by MusicCoCa. MusicCoCa fuses text and audio data into a shared embedding space, serving as a control signal for the generative model.

Audio Tokenization via SpectroStream

SpectroStream functions as the discrete audio codec in Magenta RT, transforming raw audio into RVQ-encoded tokens while preserving audio quality. With reduced bandwidth optimized for streaming, this codec facilitates real-time generation, producing compressed audio representations essential for the model's infinite streaming capability.

Style Embeddings with MusicCoCa

MusicCoCa constructs joint audio-text embeddings, enabling detailed control over music style characteristics. These embeddings, derived from diverse textual annotations and audio data, allow fine-grained stylistic manipulations in generated audio through quantized token mapping.

Figure 2: Overall architecture of Magenta RT. Coarse acoustic tokens and quantized style tokens corresponding to 10 seconds of audio context are concatenated and fed to the model's encoder.

Model Framework

Magenta RT's framework builds on established codec LM principles but introduces adaptations for live performance: a single-stream LLM enabling efficient operation and chunk-based autoregression for infinite audio stream predictions. These adjustments ensure that the model maintains high throughput and low latency, essential for real-time applications.

Chunk-based Autoregression

The model predicts audio chunks continuously, based on limited, recent acoustic history, without preserving information beyond the context window. This stateless inference approach minimizes errors and supports flexible real-time control, effectively generating audio with a Real-Time Factor (RTF) greater than or equal to one.

Experiments

Audio Quality Evaluation

Empirical analysis demonstrates that Magenta RT surpasses existing models like Stable Audio Open and MusicGen Large in quality metrics, including FD $_{openl3}$ and KL $_{passt}$ . Despite employing fewer parameters, Magenta RT achieves superior audio generation fidelity, showing strong adherence to text prompts while enabling arbitrary-length audio outputs.

Figure 3: Evaluation demonstrates transitions between prompt embeddings via linear interpolation, indicating strong model adherence to sequential style changes.

Musical Transitions Experiment

Magenta RT's ability to smoothly transition between musical styles under dynamic prompting conditions showcases its unique capabilities for live interaction, preserving coherence throughout stylistic evolution and offering seamless blend experiences for users.

Applications and Implications

The models bring transformative potential to live music production by offering creative interactive tools for musicians and producing continuous, adaptable soundscapes in real-time. By enhancing human-machine musical experiences, they can redefine live performance paradigms and foster novel artistic possibilities in AI music generation.

Figure 4: Steering with live audio stream illustrates user-controlled audio progression using the audio injection method, allowing real-time modification to the output.

Conclusion

Live music models, through systems like Magenta and Lyria RT, advance the landscape of generative AI in music by providing real-time interaction and control capabilities. Their innovative approach to codec LLMing and style embedding sets foundations for future AI-centric music performance enhancements, promoting more engaging, responsive, and personalized musical expressions. Further developments aiming at reducing latency will expand their interactive potential as musical partners and sophisticated synthesisers.