HeartMuLa: A Family of Open Sourced Music Foundation Models

Published 15 Jan 2026 in cs.SD | (2601.10547v1)

Abstract: We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor, a robust lyric recognition model optimized for real-world music scenarios; and (3) HeartCodec, a low-frame-rate (12.5 Hz) yet high-fidelity music codec tokenizer that captures long-range musical structure while preserving fine-grained acoustic details and enabling efficient autoregressive modeling; (4) HeartMuLa, an LLM-based song generation model capable of synthesizing high-fidelity music under rich, user-controllable conditions (e.g., textual style descriptions, lyrics, and reference audio). In addition, it provides two specialized modes: (i) fine-grained musical attribute control, which allows users to specify the style of different song sections (e.g., intro, verse, chorus) using natural language prompts; and (ii) short, engaging music generation, which is suitable as background music for short videos. Lastly, HeartMuLa improves significantly when scaled to 7B parameters. For the first time, we show that a Suno-level, commercial-grade system can be reproduced using academic-scale data and GPU resources. We expect these foundation models to serve as strong baselines for future research and to facilitate practical applications in multimodal content production.

Abstract PDF Upgrade to Chat

Summary

The paper presents a unified, open-source music model that integrates audio-text alignment, lyric recognition, and controllable generation.
It employs innovative multi-stage tokenization and hierarchical transformers for minute-scale, high-fidelity synthesis with fine-grained control.
Empirical evaluations show that HeartMuLa outperforms competitors in lyric intelligibility, structural coherence, and style adherence across multiple languages.

HeartMuLa: An Open Music Foundation Model System for Multimodal Understanding and Generation

Framework Overview and Motivation

HeartMuLa (2601.10547) introduces a unified family of open-sourced music foundation models targeting scalable music understanding and conditional generation across multiple modalities. The framework consists of four principal modules: HeartCLAP (audio-text alignment), HeartTranscriptor (lyric recognition), HeartCodec (low-frame-rate music codec tokenizer), and HeartMuLa (LLM-based controllable music generator). This system addresses reproducibility constraints and fine-grained control limitations persistent in commercial systems (e.g., Suno, Udio), enabling high-fidelity, conditional music synthesis using only academic-scale resources and open datasets.

The key design objectives are: (1) Extensible cross-modal system design—integrating music, lyric, and style conditions with bidirectional retrieval and alignment; (2) Scalable tokenization for long-form structure—enabling minute-scale, coherent song generation; and (3) Robust, high-fidelity reconstruction even under ultra-low bitrate constraints. HeartMuLa also supports global music structure control, fine-grained user specification of song sections via textual prompts, and efficient batch or streaming inference pipelines.

Figure 1: HeartMuLa-oss-3B achieves competitive and stable performance across multiple music generation baselines and foundation models.

HeartCodec: Semantic-Rich, Ultra-Low Frame-Rate Music Tokenizer

Architectural Innovations

HeartCodec is central to the system, offering a multi-stage pipeline for semantic-rich music tokenization and high-quality reconstruction:

Figure 2: HeartCodec architecture—semantic-rich encoder, ultra-low frame rate compressor, high-fidelity reconstruction decoder.

Multi-encoder feature fusion extracting complementary phonetic (Whisper, WavLM), music semantic (MuEncoder), and acoustic details.
Query-based quantization compresses features to 12.5 Hz, yielding compact RVQ token streams with K=8 codebooks, V=8192 vocab per book.
Flow-matching and reflow distillation—enables mapping from quantized discrete tokens to continuous latent space and efficient waveform reconstruction via a SQ-Codec.

This setup delivers both semantic expressiveness and breakthrough in computational efficiency, leveraging hierarchical loss objectives (feature alignment, commitment, adversarial) and multi-stage training (pretrain, reflow, fine-tune). HeartCodec achieves state-of-the-art VISQOL, FAD, FD scores at 1.3 kbps, outperforming prior codecs (SemantiCodec, XCodec, MuCodec, LeVo) in global reconstruction and aesthetic quality. STOI, PESQ, SPK_SIM, and WER scores demonstrate preservation of vocal intelligibility and identity. Subjective MOS scores consistently favor SQ-Codec over alternative latent designs.

Ablation and Training Impacts

Ablation studies show that both reflow distillation and SQ-Codec fine-tuning incrementally improve downstream music generation quality, yielding best performances in aesthetic and alignment metrics. SQ-Codec latents are selected for optimal trade-offs between computational efficiency (RTF), reconstruction fidelity, and human-perceived musicality.

HeartMuLa: Hierarchical, Conditioned Music Generation Model

Hierarchical Token Modeling

Figure 3: HeartMuLa—global-local architecture for hierarchical, efficient, and fine-grained song synthesis conditioned on lyrics, styles, and audio prompts.

HeartMuLa utilizes HeartCodec’s token sequences within a hierarchical global-local transformer setup. Global transformer models long-term structure (layer-0 RVQ token prediction), while local transformer captures residual acoustic detail given global context.

Input conditions include structured lyrics (with section markers), categorical style tags (priority-weighted selection), and optional reference audio embeddings (MuQ-MuLan).
Conditioning information is prepended or interleaved according to section annotations, enabling coherent, long-form music generation and fine-grained intra-song attribute control.

Progressive Training and Optimization

Figure 4: Four-stage progressive training paradigm—warmup (local contexts), pretraining (full song, complete conditions), supervised finetuning (filtered high-quality subset), reinforcement alignment via Direct Preference Optimization (DPO).

Training progresses through warmup, data-intensive pretraining (100k hours music), supervised finetuning, and DPO-based RL for multi-preference alignment. DPO enhances semantic fidelity, vocal clarity, and style adherence by optimizing log-probability ratios over preference pairs, integrating cross-entropy objectives over hierarchical codes.

Evaluation and Results

HeartMuLa’s objective metrics consistently match or surpass both open and closed-source baselines (LeVo, DiffRhythm2, Ace-Step, Suno-v5/v4.5, MiniMax, Udio), particularly in lyric intelligibility (lowest PER across all languages), structural and naturalness ratings (SongEval), style adherence (Tag-Sim), and overall audio quality (AudioBox).

It demonstrates strong geographical and linguistic generalization with stable performance for English, Chinese, Japanese, Korean, and Spanish benchmarks.
Subjective evaluation confirms high MOS scores in musicality, harmony, structure, fidelity, creativity, and text alignment.
Ablation shows additive improvements with SFT, and DPO stages individually enhancing specific dimensions (semantic, intelligibility, audio quality).
Inference optimizations (KV-cache, FlashAttention, CUDA Graph) converge to 5.4× speedup for long-form generation without degradations in musical quality.

HeartCLAP and HeartTranscriptor Modules

HeartCLAP: Robust Audio-Text Alignment

HeartCLAP employs contrastive InfoNCE objectives over a shared embedding space (initialized from MuQ-MuLan). Attribute-level and tag-level masking during multi-format training allow for generalization across prompt types and incomplete conditions. Empirical benchmarks (WiKiMT-X) yield best recall and mAP in both text-to-music and music-to-text retrieval versus Laion-CLAP and MuQ-MuLan, validating discriminative, robust cross-modal alignment.

HeartTranscriptor: Lyric Recognition for Music-Specific ASR

HeartTranscriptor adapts Whisper (full fine-tuning) to music-specific vocal signals, employing a high SNR dataset curated using Demucs separation and hierarchical error filtering. It consistently outperforms Whisper, SongPrep, FireRed-ASR, and Qwen3-Omni models on both full-song and short-slice benchmarks for English, Chinese, Japanese, Korean, and Spanish, as measured by WER and CER.

Data Foundation, Annotation, and Benchmarks

The model’s dataset aggregation procedure encompasses structural lyrics annotation (SongFormer), fine-grained style annotation (Qwen-2.5-Omni), and professional expert curation (HeartBeats-Benchmark). Inputs are diversified by dimension dropout, and all training data passes automated and manual filtration to ensure semantic and aesthetic validity.

Theoretical and Practical Implications

HeartMuLa establishes strong baselines and open protocols for the academic and developer communities. The combined architectural and training advances permit scalable, reproducible, and controllable music synthesis, supporting both minute-scale generation and section-level style manipulation. The system demonstrates cross-modal retrieval robustness (CLAP), domain-optimized ASR under complex musical signals (Transcriptor), and inference acceleration critical for practical deployment.

In theoretical terms, the model leverages hierarchical factorization and conditioning, tokenization strategies that may generalize to broader multimodal generative tasks. The success of progressive alignment with DPO suggests wider applicability for RL-style alignment in sequence modeling beyond natural language.

Conclusion

HeartMuLa offers a comprehensive open-source solution for music understanding and generation, integrating state-of-the-art modules for cross-modal alignment, tokenization, ASR, and controllable synthesis. Its empirical results substantiate both superior objective and subjective performance relative to competing open and commercial models, particularly in fine-grained controllability and structural generalization. The open dissemination of weights, protocols, and benchmarks will facilitate future research into scalable multimodal foundation models and creative AI music production.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces HeartMuLa, a family of open-source AI models that can understand and create music. The goal is to make high-quality, controllable music generation possible for researchers and creators using tools that are free to use and study. The system can match text to music, recognize lyrics, turn audio into compact “tokens,” and then generate full songs (up to six minutes) with user controls like style, mood, and section-by-section guidance (intro, verse, chorus). The authors show their open models can reach commercial-level quality using academic-scale data and GPUs.

What questions does the paper try to answer?

The paper focuses on simple but big questions:

How can we build an open, powerful music AI that others can reproduce and improve?
How can a model keep long songs coherent (not just short loops) while still sounding detailed and natural?
How can users easily control the music’s style, structure, and lyrics?
Can we compress music into fewer, smarter pieces so models can handle long songs efficiently without losing sound quality?

How did they do it? (Methods in everyday language)

The system has four main pieces that work together, like a band with different instruments:

HeartCLAP: Think of this like a “matchmaker” between music and text. If you give it a description (e.g., “upbeat pop with female vocals”), it can find music that fits, and vice versa. This helps align words and sounds.
HeartTranscriptor: This is a lyrics listener. It’s trained to hear and write down the words sung in a song, even in real-world, noisy music.
HeartCodec: This is the “music tokenizer”—it turns a song into a sequence of tiny tokens (like musical letters) and back again. The trick:
- It listens with multiple “ears” at once: some hear detailed sound texture, some hear musical structure, and some focus on vocals and pronunciation.
- It compresses the music into tokens very slowly (only 12.5 snapshots per second), which makes long songs easier to model, but still keeps rich detail using a layered “Lego-like” code called RVQ (Residual Vector Quantization).
- To rebuild the sound from tokens, it uses a “flow” model that starts from noise and “sculpts” it into clean audio. They also speed this up with a technique called Reflow distillation and choose a continuous representation (SQ-Codec) that balances quality and speed well.
HeartMuLa (the music maker): This is the generator that writes songs from tokens, guided by your inputs. It has a two-part brain:
- A global planner that decides the big picture over time (structure, main musical ideas).
- A local detailer that fills in fine-grained sound (timbre, texture).

It can take three types of guidance: - Lyrics (with labels like [intro], [verse], [chorus] to shape song structure) - Style tags (genre, mood, instruments, etc.) - Reference audio (a short clip to hint at overall style, using a representation that avoids copying a singer’s voice)

It’s trained in four stages: 1) Warmup on short clips to learn clear sounds fast 2) Pretraining on full songs to learn long-range structure 3) Supervised finetuning on the best-quality data to polish quality 4) Preference training (DPO): the model sees pairs of outputs and learns to prefer the better one (clearer vocals, better style match, nicer sound)

What did they find, and why is it important?

Here are the key results in plain terms:

The new tokenizer (HeartCodec) sets a high bar. Even though it uses very few snapshots per second (12.5 Hz), it reconstructs music with excellent quality and preserves singer identity and lyrics clarity better than many alternatives. This is important because fewer tokens mean the generator can handle much longer songs reliably.
A specific choice of “continuous latent” (SQ-Codec) gave the best mix of audio quality and speed compared to other options they tried.
Fine-tuning steps mattered: speeding up the flow model (Reflow) and then fine-tuning the final decoder (SQ finetune) improved both measured quality and how good the music sounds to human listeners.
For generation, a “guidance scale” of about 1.25 sounded most natural to human ears (higher values can push clarity but may sound harsh).
HeartMuLa can:
- Generate long songs (up to six minutes) that stay coherent
- Let users control different sections (intro/verse/chorus) with natural language
- Produce short, catchy tracks for short videos
- Improve markedly when scaled up (e.g., 7B parameters)
The system is open-source and reaches “commercial-grade” quality using academic-scale data and GPUs, which makes high-level music AI more accessible to the research community.

What’s the bigger impact?

This work could change how people build and use music AI:

For researchers: It provides strong, open baselines—clear starting points that others can reproduce, test, and improve. This speeds up progress and makes results more trustworthy.
For creators: It offers fine control (by lyrics, style, and section) and can generate music that’s both long-form and high-fidelity, useful for songwriting, demos, and content creation (like background music for videos).
For industry and tools: The efficient token design means models can handle long, structured music without massive compute costs, opening doors for real-world applications and potentially on-device or streaming scenarios.
For ethics and safety: The style reference avoids carrying singer timbre information, helping reduce the risk of copying someone’s voice.

In short, HeartMuLa shows that powerful, controllable, and high-quality music generation can be built openly and efficiently, setting a solid foundation for the next generation of music AI.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, focused list of what remains missing, uncertain, or unexplored in the paper. Each item is phrased to be concrete and actionable for future research.

Data transparency and licensing: The “internal dataset” of ~600k songs and the 100,000-hour corpus are not described (sources, genres, languages, rights/permissions, geographic/cultural coverage); release plans and licensing constraints for reproducing training remain unclear.
Benchmark disclosure: The HeartBeats-Benchmark (English) is referenced but not specified (size, splits, genres, annotation protocol, licensing), hindering reproducibility and fair comparison.
Component coverage gaps: HeartCLAP and HeartTranscriptor are introduced but lack architecture details, training procedures, benchmarks, and head-to-head comparisons against CLAP or state-of-the-art lyric/ASR models in musical conditions.
Suno-level claim validation: The claim of “Suno-level, commercial-grade reproduction” is not supported by direct, blinded comparative human studies or objective benchmarks against commercial systems (e.g., Suno, Udio) or widely used academic baselines (MusicGen/MusicLM).
Long-form generation evaluation: The assertion of up to 6-minute generation lacks quantitative and qualitative evaluation—no metrics for structural coherence across sections, repetition avoidance, transitions, and overall narrative form for lengthy songs.
End-to-end latency and scalability: RTF is reported only for HeartCodec; end-to-end generation latency, throughput, and memory footprint (global-local transformer, streaming mode) on typical inference hardware remain unmeasured.
Controllability assessment: No rigorous metrics or user studies quantify adherence to fine-grained section-level style controls (intro/verse/chorus), nor handling of conflicting conditions among lyrics, tags, and reference audio.
Music theory control: The system does not provide or evaluate control over tempo, time signature, key, chord progression, or structure duration; how to add and evaluate these constraints remains open.
Multilingual generalization: Training/evaluation is effectively English-centric; performance on non-English lyrics (alignment, pronunciation, prosody) is unknown despite Whisper’s multilingual capability.
Lyric-to-audio synchronization: There is no metric or analysis of temporal alignment between generated vocals and provided lyrics (e.g., forced alignment accuracy, syllable timing, prosodic fit to meter/rhythm).
Robustness and OOD behavior: Generalization to noisy, live, lo-fi recordings, non-stereo inputs, extreme vocal techniques (rap, growl, screams), dense mixes, and atypical genres is not tested.
Safety and ethics: Memorization/plagiarism risk, style imitation boundaries, and voice timbre leakage are not empirically audited (the claim that MuQ-MuLan excludes speaker timbre is unverified); content safety filters/policies are not specified.
Reference audio conditioning: The relative weighting and interactions between reference audio embeddings vs. text tags/lyrics are not ablated; failure modes under conflicting conditions and generalization to user-supplied external references are unknown.
Automated metric validity: Heavy reliance on AudioBox and SongEval (learned metrics) lacks correlation studies with large-scale human judgments across genres/cultures; statistical significance testing is absent.
Human evaluation scale: Codec subjective tests involve only five expert listeners; no large, diverse listener study for HeartMuLa’s generative quality, controllability, lyric intelligibility, and musicality.
Query-based downsampling and RVQ design: No ablation on downsampling ratio, codebook count K, vocabulary size V, and per-layer weighting; sensitivity analyses and trade-offs (bitrate vs. fidelity vs. controllability) are missing.
Transient/percussive fidelity: The 12.5 Hz frame rate may blur fast transients; systematic analyses on drums, sharp attacks, and microtiming accuracy are not provided.
DPO training specifics: Construction of preference pairs (data size, criteria, noise handling), the choice of β temperature, reference policy selection, and the impact on global vs. local layers are not documented or ablated; human-in-the-loop preferences are absent.
CFG in generation: Classifier-Free Guidance is tuned for the codec; CFG dropout/scale for HeartMuLa generation and its perceptual impact are not explored.
Tag selection probabilities: Heuristic tag-category probabilities (e.g., genre 0.95, topic 0.1) are not justified or learned; their effect on style adherence and potential bias should be ablated or learned adaptively.
HeartCLAP integration: The audio-text alignment model is not integrated into generation (conditioning or evaluation); how HeartCLAP could improve style adherence or retrieval-augmented generation remains unexplored.
HeartTranscriptor utility: While used to compute WER/PER, there is no independent benchmark (lyrics in music vs. clean speech) or analysis of error types (coarticulation, singing prosody); its reliability for evaluating lyric intelligibility is uncertain.
SQ-Codec dependence: The decoder is adapted to SQ-Codec latents; portability to other continuous tokenizers, licensing constraints, and generalization across codecs are not examined.
Model scaling evidence: The paper claims substantial improvement at 7B parameters but presents only 3B+300M results; concrete scaling laws, costs, and gains are missing.
Training compute and reproducibility: Training requires up to 64 A100s; exact recipes (epochs, schedules, curriculum, checkpointing) are partly truncated and may be insufficient for independent reproduction on modest academic resources.
Stem-level control: The system does not offer or evaluate separate control over stems (vocals, drums, bass, instruments) or mixing parameters, despite reporting mixing similarity in codec tests.
Streaming pipeline: Although HeartCodec is designed for streaming, the end-to-end streaming generation pipeline (chunking, lookahead, latency, artifacts at chunk boundaries) is not evaluated.
Fair bitrate comparisons: HeartCodec is compared against baselines with differing bitrates/frame rates; controlled experiments at matched bitrates/frame rates are needed to isolate architectural advantages.
Data filtering effects: Use of AudioBox/SongEval for filtering “high quality” subsets may cause distribution shifts or metric overfitting; impact on generalization and bias is not studied.

View Paper Prompt View All Prompts

Glossary

Ablation study: A controlled experiment removing or altering components to assess their impact on performance. "ablation studies on design choices."
AdamW optimizer: An optimization algorithm that decouples weight decay from the gradient update for better generalization. "We use the AdamW optimizer with a base learning rate of $1\times10^{-4}$ ,"
Adversarial loss: A GAN-based objective that trains a generator against a discriminator to improve realism. "The adversarial loss $\mathcal{L}_{\mathrm{adv}$ is computed based on the discriminator outputs."
AudioBox: An automated evaluation framework providing aesthetic and perceptual audio metrics. "including Content Evaluation (CE), Content Understanding (CU), and Perceptual Quality (PQ) from AudioBox \cite{tjandra2025metaaudioboxaestheticsunified};"
Autoregressive modeling: Generating sequences by predicting each token conditioned on previous ones. "enabling efficient autoregressive modeling;"
Binary mask: An element-wise mask used to select or hide parts of a tensor during training/inference. "we apply a binary mask $m$ "
Blind listening test: A subjective evaluation where listeners do not know which system produced each sample. "We conducted a blind listening test with five expert listeners with musical backgrounds."
Classifier-Free Guidance (CFG): A technique to control conditioning strength in generative models without an explicit classifier. "the classifier-free guidance (CFG) scale is set to 1.25."
Codebook: A finite set of vectors used to quantize continuous features into discrete tokens. "with $K = 8$ codebooks of vocabulary size $V = 8192$ "
Content Evaluation (CE): An AudioBox metric assessing the content-related quality of audio. "including Content Evaluation (CE), Content Understanding (CU), and Perceptual Quality (PQ) from AudioBox"
Content Understanding (CU): An AudioBox metric gauging how well audio content is semantically conveyed. "including Content Understanding (CU)"
Cosine learning rate scheduler: A schedule that decays the learning rate following a cosine curve, often with warmup. "together with a cosine learning rate scheduler that includes a warm-up period of the first $3\%$ steps."
Cross-modal retrieval: Retrieving items across different modalities, such as matching text to audio. "enabling accurate music tagging and cross-modal retrieval,"
Diffusion Transformer: A transformer-based backbone for diffusion/flow models in generative tasks. "We utilize a Diffusion Transformer \cite{dit} backbone based on LLaMA architecture \cite{llama3}"
Diffusion-based music synthesis: Music generation using diffusion models that iteratively denoise signals. "as well as diffusion-based music synthesis~\cite{diffrhythm}"
Direct Preference Optimization (DPO): A preference-based alignment method that optimizes policies without explicit reward modeling. "Direct Preference Optimization (DPO) \cite{rafailov2023direct}"
ECAPA-TDNN: A neural architecture for robust speaker embeddings and verification. "Speaker Similarity (SPK_SIM) computed by ECAPA-TDNN model \cite{desplanques2020ecapa}"
Flow matching: A generative modeling approach learning a vector field to transport noise to data. "The latent distribution is modeled using flow matching \cite{rectifiedflow_reflow}"
Fréchet Audio Distance (FAD): A distributional metric measuring distance between sets of audio embeddings. "FrÃ©chet Audio Distance (FAD)"
Global-Local architecture: A hierarchical design with a global model for structure and a local model for details. "Given HeartMuLa's Global-Local architecture, where a lightweight Local Transformer predicts a multi-layer RVQ,"
HiFi-GAN: A high-fidelity neural vocoder for waveform synthesis from spectral features. "followed by waveform synthesis via HiFi-GAN \cite{kong2020hifi}"
LLaMA architecture: Meta’s LLM transformer backbone used as a base. "based on LLaMA architecture \cite{llama3}"
Llama-3.2 tokenizer: The tokenization system associated with the Llama 3.2 model family. "Llama-3.2 tokenizer \cite{llama3}"
Mel VAE: A variational autoencoder operating on Mel-spectrogram representations. "Mel VAE, 1D VAE, and SQ-Codec"
MuEncoder: A pretrained (and fine-tuned) music encoder providing semantic/acoustic representations. "we use our training data to fine-tune the MuEncoder with the BEST-RQ \cite{best-rq,musicfm} loss."
MuQ-MuLan: A model for music tagging/embedding used for style and tag similarity features. "extracted via the MuQ-MuLan model \cite{zhu2025muq}."
Perceptual Evaluation of Speech Quality (PESQ): An objective speech quality metric correlating with human judgment. "Perceptual Evaluation of Speech Quality (PESQ)"
Perceptual Quality (PQ): An AudioBox metric assessing overall perceived quality of audio. "Perceptual Quality (PQ)"
Phoneme Error Rate (PER): An intelligibility metric measuring errors at the phoneme level. "measured by the phoneme error rate (PER)"
Query-based quantization: A tokenization method inserting learnable queries to summarize frames before quantization. "via a query-based quantization strategy \cite{almtokenizer}"
Real-Time Factor (RTF): The ratio of processing time to audio duration, indicating efficiency. "Real-Time Factor (RTF)"
Reflow distillation: A distillation technique for flow models that reduces sampling steps while preserving quality. "We perform reflow distillation on top of HeartCodec (Pt. {paper_content} Ft.)."
Residual Vector Quantization (RVQ): A multi-stage quantization scheme using residual codebooks for compact discrete tokens. "residual vector quantization (RVQ) module \cite{soundstream_rvq}"
Short-Time Objective Intelligibility (STOI): An objective intelligibility metric for speech (and vocals). "Short-Time Objective Intelligibility (STOI)"
SongEval: An automated evaluation suite measuring musical attributes like coherence and musicality. "SongEval \cite{yao2025songevalbenchmarkdatasetsong}"
Speaker Similarity (SPK_SIM): A metric quantifying how similar two speaker identities are. "Speaker Similarity (SPK_SIM)"
SQ-Codec: A continuous audio tokenizer (SimpleSpeech SQ-Codec) used for latent reconstruction. "select a 25 Hz SQ-Codec \cite{simplespeech_sqcodec}"
STFT spectrogram: A time-frequency representation computed via the Short-Time Fourier Transform. "an MSE loss computed on the STFT spectrogram."
Tag similarity (Tag-Sim.): Cosine similarity between audio and prompt tag embeddings to assess style adherence. "Tag Similarity (Tag-Sim.) defined as the cosine similarity between the embeddings of the generated audio and the prompt style tags,"
Vector field: A function mapping points to velocities; in flow matching, it transports noise to data. "a vector field $v_{\theta}$ transforms Gaussian noise $z_0 \sim \mathcal{N}(0, I)$ "
Virtual Speech Quality Objective Listener (VISQOL): An objective measure of perceived audio quality. "Virtual Speech Quality Objective Listener (VISQOL)"
WavLM: A pretrained speech model used to extract phonetic features. "WavLM phonetic features"
Weighted CrossEntropy Loss: A loss that assigns different weights to cross-entropy terms (e.g., across RVQ layers). "Weighted CrossEntropy Loss"
Whisper: A pretrained speech recognition encoder used for embeddings in the pipeline. "Whisper embeddings"
Word Error Rate (WER): An objective metric quantifying transcription errors at the word level. "Word Error Rate (WER)"

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed with the current HeartMuLa suite (HeartCLAP, HeartTranscriptor, HeartCodec, HeartMuLa), given standard GPU inference and the released open-source assets.

Short-video background music generation at scale
- Sectors: media/entertainment, social platforms, advertising
- What: Generate short, engaging BGM tailored to tags and brief prompts using HeartMuLa’s “short-music” mode; batch-generate variants for A/B testing of ads.
- Tools/products/workflows: “BGM Autopilot” API for TikTok/Reels/Shorts editors; plug-ins for Adobe Premiere/CapCut/DaVinci to render multiple 10–30s options from text tags (genre, mood, instruments).
- Assumptions/dependencies: GPU-backed inference; prompt and tag curation pipeline; platform policy compliance for synthetic audio disclosures.
Creator-focused song prototyping with section-level control
- Sectors: music production, independent creators, advertising
- What: Draft complete song structures (intro/verse/chorus/bridge) with natural-language prompts for each section; condition on lyrics and a reference audio clip for stylistic guidance.
- Tools/products/workflows: DAW plug-in (VST/AU) that renders section-by-section drafts; “Jingle Generator” for ad agencies, templated by structure and slogan-like lyrics.
- Assumptions/dependencies: Clear user prompts/lyrics; DAW integration wrappers; legal guidance on commercial release of synthetic works.
Text-to-music retrieval, tagging, and recommendation
- Sectors: streaming, catalog management, music discovery
- What: Use HeartCLAP embeddings to power cross-modal search and auto-tagging of catalogs; enable “describe what you want” music search for editors or end users.
- Tools/products/workflows: Indexing pipelines that embed tracks and text tags; retrieval APIs for editorial teams and consumers.
- Assumptions/dependencies: Domain adaptation for specific catalogs; governance for tag ontologies and bias mitigation.
Lyric transcription for subtitling, karaoke, and rights workflows
- Sectors: media localization, streaming, UGC moderation, publishing
- What: HeartTranscriptor extracts lyrics from mixed music for subtitles, karaoke timing, lyric synchronization, and metadata enrichment; supports quality checks via PER/WER.
- Tools/products/workflows: “Auto-Sub Lyric” batch service for UGC platforms; QC dashboards using intelligibility metrics.
- Assumptions/dependencies: Language coverage/accuracy on target markets; human-in-the-loop for quality-critical releases.
Efficient music tokenization for dataset compression and training
- Sectors: MLOps, research labs, music tech startups
- What: HeartCodec’s 12.5 Hz tokens reduce sequence lengths and storage costs for training/generation, enabling longer-context modeling and faster AR inference.
- Tools/products/workflows: Dataset tokenization pipelines; caching/token stores; streaming generation servers with low-latency token decoding.
- Assumptions/dependencies: Adoption of SQ-Codec backend; reproducible preprocessing; licensing for pretrained components (Whisper/WavLM).
Quality assurance and alignment gating for generative outputs
- Sectors: platform safety, production QA, enterprise media
- What: Use Tag Similarity (MuQ-MuLan), AudioBox, SongEval, and PER/WER to automatically filter low-quality or off-spec generations before delivery.
- Tools/products/workflows: “Gen-Audio QA Gate” that scores generations and enforces thresholds for style adherence, intelligibility, and aesthetics.
- Assumptions/dependencies: Access to evaluation models; thresholds tuned per use case; acceptance of proxy metrics in production SLAs.
Educational tools for songwriting and arrangement practice
- Sectors: education, edtech, community music programs
- What: Students input lyrics and structural markers to hear multiple stylistic realizations; compare versions to learn about form, arrangement, and mood.
- Tools/products/workflows: Classroom web app with section-by-section prompts; teacher dashboards to manage assignments and references.
- Assumptions/dependencies: GPU or managed inference; content safety filters for classrooms; licensing for public demos.
Game and app soundtracks that fit genre/mood briefs
- Sectors: gaming, mobile apps, indie studios
- What: Generate loopable, stylistically consistent tracks and stingers from tag prompts, with long-form coherence up to minutes for level themes.
- Tools/products/workflows: Build-time asset generation; runtime selection of pre-generated variants by level/mood.
- Assumptions/dependencies: Looping support handled at DAW/post step; creative review for brand fit.
Dataset curation and benchmarking for music ML
- Sectors: academia, industrial research, startups
- What: Reuse the published training/eval pipeline (AudioBox, SongEval, Tag-Sim, WER/PER) for model comparisons and ablation studies; reproduce Suno-level performance on academic-scale hardware.
- Tools/products/workflows: Reproducible training scripts; benchmark suites; demo notebooks with HeartMuLa-oss-3B and 7B variants.
- Assumptions/dependencies: Access to multi-GPU nodes (A100-class) or cloud; data licensing for curated corpora.
Compliance and moderation aids for UGC audio
- Sectors: platforms, compliance, trust & safety
- What: Use HeartTranscriptor to detect explicit lyrical content; HeartCLAP to flag mismatched/unsafe tags; route flagged items for review.
- Tools/products/workflows: Moderation queues augmented with PER/Tag-Sim signals; policy-based routing.
- Assumptions/dependencies: Policy definitions; evaluator calibration; regional legal requirements.

Long-Term Applications

These require further research, scaling, or productization beyond the current release (e.g., model distillation/quantization, additional modalities, expanded datasets, regulatory alignment).

Real-time, interactive co-creation and live performance
- Sectors: live music, streaming, creator tools
- What: Low-latency token streaming and conditional control for improvisation with human performers; dynamic prompt changes per song section.
- Dependencies: Further latency reductions, on-device inference or fast edge serving, robust beat/tempo synchronization.
Adaptive, context-aware game and XR soundtracks
- Sectors: gaming, AR/VR
- What: Runtime music that adapts to gameplay state, narrative arcs, and player emotion using fine-grained section control and style tags.
- Dependencies: Stable real-time generation, state-to-tag mapping pipelines, fail-safe fallbacks.
On-device/mobile music generation
- Sectors: consumer software, hardware OEMs
- What: Local generation using quantized/distilled 3B or smaller models leveraging HeartCodec’s low-frame-rate tokens to conserve compute and battery.
- Dependencies: Aggressive model compression, hardware acceleration (NPUs/GPUs), memory-optimized decoding.
Personalized wellness and music therapy tools
- Sectors: healthcare, wellness apps
- What: Generate calming/energizing music tailored to preferences or biofeedback; DPO-like preference alignment for clinically meaningful outcomes.
- Dependencies: Clinical evaluation, ethical frameworks, robust safety filters; data consent for personalization.
Multilingual lyric generation and transcription at parity
- Sectors: global media, localization
- What: End-to-end multi-language lyric conditioning and accurate transcription across languages for karaoke/subtitling and generation.
- Dependencies: Training on multilingual corpora, evaluation datasets beyond English, cultural/linguistic guardrails.
DAW-native, professional-grade production workflows
- Sectors: studio production, post-production
- What: Deep integration for arrangement drafts, style iteration, and revision control; eventual multi-track/stem-aware generation for mixing workflows.
- Dependencies: Model extensions for stem/multi-track control, latency improvements, IP and crediting standards.
Automated audio ad and sonic branding optimization
- Sectors: advertising, marketing tech
- What: Generate many on-brief variants; optimize via preference learning and brand/style adherence metrics; maintain structural templates for campaigns.
- Dependencies: Brand-safe style constraint systems, human feedback loops, measurement integrations.
Open evaluation standards for synthetic music quality and safety
- Sectors: policy, standards bodies, platforms
- What: Formalize metric suites (AudioBox, SongEval, PER/Tag-Sim) as auditing criteria for deployed generative music systems.
- Dependencies: Cross-industry consensus, dataset governance, transparent reporting and appeals processes.
Accessibility-first audio generation
- Sectors: accessibility, education, productivity
- What: Tailored music/soundscapes to support focus, reading, or communication needs (e.g., tempo/mood constraints, predictable structure).
- Dependencies: User studies with diverse populations, safe defaults, optional clinician oversight.
Research on controllability, safety, and provenance
- Sectors: academia, policy, platform safety
- What: Expand controllable attributes (tempo maps, key changes), watermarking/provenance, and misappropriation safeguards (e.g., voice timbre protection).
- Dependencies: New control tokens/architectures, standardized watermarking, legal and ethical guidelines.

Notes on Feasibility, Assumptions, and Dependencies

Compute and latency: While HeartMuLa-oss-3B and 7B demonstrate strong results, production use typically requires GPU-backed inference or further model compression; real-time use cases depend on additional latency reductions.
Data licensing and IP: Commercial deployment must ensure training and generation align with licensing and copyright policies; reference-audio conditioning should be restricted to lawful inputs.
Safety and identity: The paper deliberately uses MuQ-MuLan embeddings that avoid speaker timbre; products should preserve this constraint or implement safeguards to prevent voice cloning.
Evaluation metrics: Automated gating with AudioBox/SongEval/Tag-Sim/PER are proxies; human review remains critical for high-stakes releases.
Language and domain coverage: HeartTranscriptor and generation quality may vary across languages and genres not well represented in training; domain adaptation may be needed.
Integration effort: DAW plug-ins, SDKs, and platform APIs will require engineering beyond the core models (UX, caching, batching, monitoring, and observability).

HeartMuLa: A Family of Open Sourced Music Foundation Models

Summary

HeartMuLa: An Open Music Foundation Model System for Multimodal Understanding and Generation

Framework Overview and Motivation

HeartCodec: Semantic-Rich, Ultra-Low Frame-Rate Music Tokenizer

Architectural Innovations

Ablation and Training Impacts

HeartMuLa: Hierarchical, Conditioned Music Generation Model

Hierarchical Token Modeling

Progressive Training and Optimization

Evaluation and Results

HeartCLAP and HeartTranscriptor Modules

HeartCLAP: Robust Audio-Text Alignment

HeartTranscriptor: Lyric Recognition for Music-Specific ASR

Data Foundation, Annotation, and Benchmarks

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does the paper try to answer?

How did they do it? (Methods in everyday language)

What did they find, and why is it important?

What’s the bigger impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Feasibility, Assumptions, and Dependencies

Open Problems

Continue Learning

Authors (28)

Collections

Tweets

YouTube

HeartMuLa: A Family of Open Sourced Music Foundation Models

Summary

HeartMuLa: An Open Music Foundation Model System for Multimodal Understanding and Generation

Framework Overview and Motivation

HeartCodec: Semantic-Rich, Ultra-Low Frame-Rate Music Tokenizer

Architectural Innovations

Ablation and Training Impacts

HeartMuLa: Hierarchical, Conditioned Music Generation Model

Hierarchical Token Modeling

Progressive Training and Optimization

Evaluation and Results

HeartCLAP and HeartTranscriptor Modules

HeartCLAP: Robust Audio-Text Alignment

HeartTranscriptor: Lyric Recognition for Music-Specific ASR

Data Foundation, Annotation, and Benchmarks

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions does the paper try to answer?

How did they do it? (Methods in everyday language)

What did they find, and why is it important?

What’s the bigger impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Glossary

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Feasibility, Assumptions, and Dependencies

Open Problems

Continue Learning

Related Papers

Authors (28)

Collections

Tweets

YouTube