Lyria RealTime: Live Music Generation API

Updated 7 August 2025

Lyria RealTime is an API-based generative music model that produces continuous, high-fidelity audio using a codec language modeling architecture.
It leverages advanced user steering via text and audio prompts along with descriptor conditioning to dynamically adapt live performance inputs.
The system minimizes latency by processing 2-second audio chunks via scalable cloud deployment, ensuring seamless, interactive streaming.

Lyria RealTime is an API-based generative music model designed for continuous, real-time music creation with extended user controls. It represents a shift in music AI towards live, human-in-the-loop performance, augmenting the foundational techniques demonstrated in open-weights models like Magenta RealTime with additional functionality, broader input modalities, and advanced control schemes, all deployed via scalable cloud infrastructure (Team et al., 6 Aug 2025).

1. Model Framework and Architecture

Lyria RealTime is constructed on a codec language modeling architecture, specifically leveraging SpectroStream—a discrete audio codec that encodes 48 kHz stereo audio into residual vector quantization (RVQ) tokens. The generative model is structured as a conditional LLM that predicts sequences of discrete audio tokens given “acoustic style” embeddings. The generation at step $i$ can be formalized as: $P_\theta(\text{Chunk}_i | \text{Coarse}_{i-H:i}, \mathbf{c}_i)$ where $\text{Chunk}_i$ comprises the audio tokens for a $2$-second segment, $\text{Coarse}_{i-H:i}$ denotes the sequence of coarsened (lower rate) context tokens from the previous $H = 5$ chunks (representing a 10 second context window), and $\mathbf{c}_i$ is the conditioning embedding encapsulating style controls.

The model utilizes an encoder–decoder Transformer backbone following the T5 architecture, with both Base and Large configurations referenced in the implementation. Notably, Lyria RealTime includes a refinement module—a dedicated MLP—which operates over decoder outputs to predict fine-resolution tokens, enhancing temporal and spectral fidelity beyond the base generation.

2. Conditioning Mechanisms and User Controls

Lyria RealTime supports multifaceted user steering through both text and audio prompts. The central conditioning vector is computed as a weighted sum of prompt embeddings: $\mathbf{c} = \frac{\sum_i w_i \cdot M(\mathbf{c}_i)}{\sum_i w_i}$ where each $\mathbf{c}_i$ represents an input prompt (textual or auditory), $w_i$ is the user-specified weight, and $M$ is the prompt encoder—utilizing MuLan in Lyria RT (compared to MusicCoCa in Magenta RT).

Beyond direct prompts, Lyria RT incorporates descriptor-based conditioning. Standard Music Information Retrieval (MIR) methods extract features including:

Brightness and density: quantified via spectral centroid and onsets, respectively.
Key and tempo: inferred through specialized beat and key detection models.
Stem-specific controls: extracted by stem separation networks, enabling isolated manipulation of components such as bass, drums, or vocals.

A notable extension is “audio injection”, a mechanism allowing live user audio input to be intermixed with model-generated audio; the resulting mix, after tokenization, is recursively used as context for subsequent predictions. The model thus operates as an adaptive accompanist, dynamically responding to human musical actions.

3. Real-Time Streaming and Latency Management

The architecture is optimized for uninterrupted live music streaming. Each generative step operates on 2-second audio chunks, with context windows that cover the immediate past (10 seconds) to maintain temporal continuity and musical coherence. The chunked structure minimizes perceived latency and supports seamless integration into live performance pipelines.

Live streaming is realized through the model's API-based deployment: audio tokens are generated, decoded, and streamed without interruption, with real-time user controls processed and reflected in the generated output with minimal latency. This infrastructure is designed for demanding live music production and performance environments.

4. Quantitative and Qualitative Performance

Evaluation metrics in the primary data emphasize the performance of Magenta RealTime, reporting that it achieves superior music quality on automatic metrics despite a comparatively smaller parameter count. The referenced metrics include:

Fréchet Distance on OpenL3 (FD_openl3)
KL divergence to past music token distributions (KL_past)
CLAP score (Contrastive Language-Audio Pretraining metric)

While detailed Lyria RT metrics are not separately reported, it is based on an extended architecture that includes additional controls and a refinement module, and is deployed on more powerful cloud hardware. This suggests that Lyria RealTime could achieve equal or superior subjective and objective quality in live musical contexts.

5. Use Cases and Intended Applications

Lyria RealTime is explicitly designed for real-time, interactive music creation, accommodating a spectrum of live use cases:

Live Performance: Musicians leverage dynamic prompt steering and descriptor-based controls to co-create music with the system during concerts or improvisational settings.
Generative Accompaniment: The audio injection and feedback design enables Lyria RT to function as a real-time accompanist, adapting to user audio and guiding ensemble-like collaboration.
Creative Production Tools: Through its API access, producers and artists can automate or semi-automate music generation, integrating human taste with large-scale, high-fidelity generative modeling.
Installations and Interactive Installations: The model’s flexible control interface supports integration into interactive sound art, gaming, or educational exhibits that respond in real time to environment or users.

6. Future Directions

Enhancements proposed for the broader “live music models” class directly inform the trajectory for Lyria RealTime:

Latency Reduction: Research is ongoing to minimize input-to-output latency, targeting ultra-low-latency real-time scenarios that may encompass direct MIDI or live audio stream steering.
Multi-Stem Training: Forthcoming work aims to train on annotated multitrack (stem-separated) audio, expanding the model’s capacity to function as a full collaborative partner or live ensemble component.
Improved Alignment: Refinement of the alignment between text and audio embeddings via self-conditioning or latent constraints is planned, with the goal of increasing control precision and generative responsiveness to a wider variety of input modalities.

A plausible implication is that these directions, once realized, will make Lyria RealTime a central component in hybrid human–AI musical systems, expanding both the practical possibilities for live AI-generated music and the theoretical understanding of symbolic and continuous musical control at scale.

7. Comparative Summary of Controls and Features

The following table summarizes core aspects of Lyria RealTime in the context of related models described in the data:

Feature	Magenta RealTime	Lyria RealTime
Deployment	Open-weights, on-device	API-based, cloud inference
User Control	Text/Audio style prompts	Text/Audio prompts, descriptors
Audio Codec	SpectroStream	SpectroStream
Refinement Stage	No	Yes (refinement MLP)
Descriptor Conditioning	Standard (basic)	Advanced (MIR, stem-aware)
Audio Injection	Yes	Yes (extended)

This comparison highlights Lyria RealTime’s expanded control scheme, API/cloud deployment model, and advances in both fidelity and interactiveness for live music applications.

PDF Markdown Chat (Pro)

References (1)

Live Music Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Lyria RealTime.