JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment (2507.20880v1)

Published 28 Jul 2025 in cs.SD and cs.AI

Abstract: Diffusion and flow-matching models have revolutionized automatic text-to-audio generation in recent times. These models are increasingly capable of generating high quality and faithful audio outputs capturing to speech and acoustic events. However, there is still much room for improvement in creative audio generation that primarily involves music and songs. Recent open lyrics-to-song models, such as, DiffRhythm, ACE-Step, and LeVo, have set an acceptable standard in automatic song generation for recreational use. However, these models lack fine-grained word-level controllability often desired by musicians in their workflows. To the best of our knowledge, our flow-matching-based JAM is the first effort toward endowing word-level timing and duration control in song generation, allowing fine-grained vocal control. To enhance the quality of generated songs to better align with human preferences, we implement aesthetic alignment through Direct Preference Optimization, which iteratively refines the model using a synthetic dataset, eliminating the need or manual data annotations. Furthermore, we aim to standardize the evaluation of such lyrics-to-song models through our public evaluation dataset JAME. We show that JAM outperforms the existing models in terms of the music-specific attributes.

Summary

The paper introduces a compact flow-matching model that enables word- and phoneme-level control for improved lyrical intelligibility and precise vocal timing.
It employs token-level duration control and temporally-aware phoneme alignment to maintain natural prosody and strict timing boundaries.
Experimental results demonstrate state-of-the-art performance in audio fidelity and aesthetic alignment, confirmed by both objective metrics and expert evaluations.

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment

Introduction and Motivation

JAM introduces a compact, flow-matching-based architecture for lyrics-to-song generation, targeting the limitations of prior models in controllability, efficiency, and alignment with human musical preferences. Unlike previous large-scale diffusion and autoregressive models, JAM is designed to provide fine-grained, word- and phoneme-level control over vocal timing and duration, while maintaining high audio fidelity and stylistic flexibility. The model is further enhanced through iterative aesthetic alignment using Direct Preference Optimization (DPO), leveraging synthetic preference data to optimize for human-like musicality and enjoyment.

Model Architecture and Training Pipeline

JAM employs a 530M-parameter conditional flow-matching model, utilizing 16 LLaMA-style Transformer layers as its Diffusion Transformer (DiT) backbone. The model is conditioned on three axes: lyrics (with explicit word/phoneme timing), target duration, and style prompt (audio or text). The audio is encoded using the Stable Audio Open VAE encoder and a DiffRhythm-initialized decoder, operating at 44.1kHz for up to 3 minutes and 50 seconds.

Figure 1: A depiction of the JAM architecture and training pipeline, highlighting the integration of lyric, style, and duration conditioning.

The training pipeline consists of three stages:

Pre-training on 90-second clips with random crops.
Supervised fine-tuning (SFT) on full-length songs.
Iterative DPO-based preference alignment using SongEval-generated synthetic preference labels.

The flow-matching approach directly regresses a time-dependent vector field, enabling efficient and stable training compared to score-based diffusion models. During inference, an ODE solver integrates the learned vector field to generate the latent representation, which is then decoded to audio.

Conditioning and Fine-grained Temporal Control

JAM's conditioning pipeline fuses lyric, style, and duration embeddings at the latent level. Lyric conditioning leverages word-level and phoneme-level timing, transformed into upsampled phoneme sequences and embedded via a convolutional network. Style conditioning uses MuQMulan embeddings, and duration conditioning combines global duration embeddings with token-level duration control (TDC). The TDC mechanism introduces a learnable bias for padding tokens, enabling the model to sharply distinguish between valid content and silence, thus achieving precise temporal boundaries.

A key innovation is the temporally-aware word-level phoneme alignment algorithm, which distributes phoneme tokens within each word's temporal span, supporting both natural prosody and explicit user control. The model supports both continuous and quantized (beat-aligned) timestamp inputs, facilitating practical user interaction and downstream integration.

Aesthetic Alignment via Direct Preference Optimization

To address the gap between technical intelligibility and human musical preference, JAM applies DPO using SongEval as the reward signal. This process iteratively refines the model by generating candidate outputs, scoring them, and optimizing the model to prefer higher-scoring samples. The DPO loss is adapted for flow-matching objectives, and a ground-truth reconstruction term is optionally included to regularize against overfitting to synthetic preferences.

This alignment strategy yields measurable improvements in both objective and subjective metrics, including content enjoyment, musicality, and vocal naturalness, as well as reductions in word and phoneme error rates.

Experimental Results

Objective Evaluation

JAM is evaluated on JAME, a newly curated, genre-diverse, contamination-free benchmark. Compared to strong baselines (LeVo, YuE, DiffRhythm, ACE-Step), JAM achieves:

Lowest WER (0.151) and PER (0.101), indicating superior lyric intelligibility and alignment.
Highest MuQ-MuLan similarity (0.759) and genre classification accuracy (0.704), reflecting strong style adherence.
Best content enjoyment (CE = 7.423) and lowest FAD (0.204), demonstrating high subjective appeal and audio fidelity.
Consistent state-of-the-art or second-best performance across all other SongEval and aesthetic metrics.

Subjective Evaluation

Human annotators with musical expertise rated JAM highest in enjoyment, musicality, and song structure clarity, and comparable to the best models in audio quality and vocal naturalness. The model's fine-grained timing control and aesthetic alignment are directly linked to these improvements.

Ablation and Analysis

Token-level duration control is critical for suppressing unwanted audio beyond the target duration, achieving sub-1% RMS amplitude leakage compared to >18% for models without TDC.
Phoneme assignment strategy impacts long-term musicality: the "Average Sparse" method yields better FAD and SongEval scores than "Pad Right," despite slightly higher PER.
Iterative DPO consistently improves musicality and aesthetic metrics, though excessive DPO can increase FAD, indicating a trade-off between alignment and perceptual realism.
Duration prediction experiments show that naive timestamp prediction (e.g., GPT-4o) degrades musicality and intelligibility, while beat-aligned quantization offers a practical compromise with minimal performance loss.

Practical Implications and Limitations

JAM's architecture enables efficient, controllable, and high-fidelity lyrics-to-song generation suitable for both professional and research applications. The model's compact size (530M parameters) allows for faster inference and lower resource requirements compared to billion-parameter baselines. The explicit support for word- and phoneme-level timing control is particularly valuable for composers and music technologists seeking precise prosodic and rhythmic manipulation.

However, the requirement for accurate word-level duration annotations limits usability for non-expert users. Experiments with duration predictors highlight the need for robust, musically-aware timestamp generation, ideally integrated into the model's training loop. The current system also lacks phoneme-level duration control, which could further enhance expressiveness and pronunciation accuracy.

Future Directions

End-to-end duration prediction: Jointly training a duration predictor with the song generator to enable robust, user-friendly inference without manual timing annotations.
Phoneme-level control: Incorporating phoneme-level alignment and duration modeling for finer expressive granularity.
Broader language and style support: Extending the model to handle non-English lyrics and a wider range of musical genres and vocal styles.
Real-time and interactive applications: Leveraging the model's efficiency for live composition, editing, and music production workflows.

Conclusion

JAM demonstrates that a compact, flow-matching-based model with explicit fine-grained temporal control and iterative aesthetic alignment can achieve state-of-the-art performance in lyrics-to-song generation. The model's design and evaluation set new standards for controllability, efficiency, and musicality in AI-driven music generation. Future work on integrated duration prediction and phoneme-level control will further enhance the model's robustness and applicability in real-world creative contexts.

PDF Markdown

Follow-up Questions

Related Papers

Authors (8)

Tweets

https://twitter.com/soujanyaporia/status/1950020552304840744

https://twitter.com/ArxivSound/status/1950353709290135916

alphaXiv

JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment (13 likes, 0 questions)