JEN-1 Composer: Multi-Track Music Generation
- JEN-1 Composer is a unified framework for high-fidelity, controllable multi-track music generation using audio latent diffusion, extending single-track methods to jointly synthesize tracks.
- It employs a progressive curriculum training strategy that incrementally teaches the model to generate from marginal, conditional, and joint multi-track distributions.
- The system supports interactive inference with cross-attention for text guidance, allowing iterative track refinement and meeting state-of-the-art text-to-music generation benchmarks.
JEN-1 Composer is a unified framework for high-fidelity, controllable multi-track music generation based on audio latent diffusion modeling. It extends single-track diffusion-based paradigms (notably JEN-1) to a multi-track setting, enabling marginal, conditional, and joint synthesis of aligned musical tracks within one model. By employing a progressive curriculum training strategy, JEN-1 Composer supports flexible Human–AI co-composition workflows and achieves state-of-the-art performance on text-to-music multi-track generation tasks (Yao et al., 2023). The system integrates tightly with text guidance via cross-attention and supports granular, iterative track manipulation, accommodating professional compositional use cases.
1. Model Architecture and Diffusion Foundation
JEN-1 Composer generalizes the single-track latent diffusion backbone of JEN-1 to multi-track music generation. Each of the input tracks (e.g., bass, drums, melody, instrument) is encoded into a shared latent space using a pre-trained encoder , and the resulting latent tensors are concatenated channel-wise. The backbone is a 1D U-Net architecture augmented with cross-attention to text features (from a frozen FLAN-T5 encoder) and per-track timestep embeddings.
The forward diffusion process for each latent is:
where is the cumulative noise schedule. The reverse (denoising) model is trained with L2 loss:
At inference, denoising adopts a DDPM-style update, with and variance determined by the learned noise and schedule.
To accommodate marginal, conditional, and joint multi-track distributions within a single network, each track is assigned its own timestep :
- 0: Track 1 is unperturbed, used as a conditioning signal.
- 2: Track 3 is jointly reconstructed.
- 4: Track 5 is fully noised, treated as a target for marginal generation.
Sampling the vector 6 during training enables the model to learn the entire spectrum of marginals, conditionals, and joint distributions (Yao et al., 2023).
2. Progressive Curriculum Training Strategy
To enhance generalization and controllability, JEN-1 Composer employs a progressive curriculum training approach. For 7 tracks, training proceeds in 8 stages:
- Stage 1: Randomly select 1 track for generation, condition on 9 clean tracks.
- Stage 2: Generate 2 tracks, condition on 0, etc.
- Stage N: Generate all tracks jointly from noise.
Stage selection at each iteration is governed by a non-decreasing probability schedule 1; harder stages become more frequent over time. A typical schedule uses:
2
where 3 controls the sharpness of the curriculum. The training loss for subset 4 of tracks to be generated is:
5
where 6 are noised latents for tracks in 7, 8 are clean latents, and 9 aggregates all timesteps.
Maintaining nonzero 0 for all curriculum stages preserves model ability for easier (conditional) tasks while emphasizing joint generation as training progresses.
3. Interactive Inference and Human–AI Co-composition
JEN-1 Composer supports an interactive workflow, aligning with professional music production:
Sampling Procedure (Pseudocode Overview)
4 Tracks set to 1 are held fixed, and only tracks designated for regeneration are denoised. Classifier-free guidance can be applied to tune the text conditioning during this process.
This co-compositional approach allows users to iteratively refine, fix, or regenerate subcomponents (e.g., drum, bass, melody), supporting granular creative workflows (Yao et al., 2023).
4. Datasets, Metrics, and Experimental Results
JEN-1 Composer was trained and evaluated on 800 hours of private multi-track studio recordings (5 aligned tracks: bass, drums, instrument, melody, mix), with splits of 80% train and 20% test. Each sample includes rich text metadata describing genre, tempo, mood, and instrumentation.
Metrics:
- CLAP alignment score: between text and generated audio, per track and for mixed sum.
- Human Relative Preference Ratio (RPR): percentage of pairwise preference wins over other models.
Baselines: MusicLM, MusicGen, Jen-1 (single-track).
Quantitative Results
| Methods | Bass | Drums | Inst. | Melody | Mixed | RPR |
|---|---|---|---|---|---|---|
| MusicLM | 0.16 | 0.17 | 0.23 | 0.28 | 0.28 | 27% |
| MusicGen | 0.17 | 0.15 | 0.25 | 0.33 | 0.35 | 36% |
| Jen-1 | 0.19 | 0.16 | 0.29 | 0.32 | 0.36 | 40% |
| JEN-1 Composer | 0.21 | 0.18 | 0.29 | 0.36 | 0.39 | — |
Ablation studies isolate the effect of per-track timestep vectors, curriculum training, and interactive inference, showing incremental improvements in CLAP and RPR scores.
Qualitative analyses indicate effective co-creation in genres such as jazz (iterative drum-bass-melody refinement), electronic dance (percussive anchors and dynamic synth shaping), and orchestral mock-up (piano motif to strings/brass generation). Real-time demos on A100 GPUs indicate ~5 s latency for 30-second, 4-track generations.
5. Relationship to JEN-1 and Prior Work
JEN-1 Composer is a direct descendant of the JEN-1 model (Li et al., 2023), which demonstrated universal text-to-music generation via omnidirectional diffusion over audio latents. JEN-1 employs a masked autoencoder for end-to-end latent representation, a 1D U-Net with cross-attention to FLAN-T5 embeddings, and multi-task heads for text-to-music, inpainting, and continuation. Key formulations, such as the 2-matching diffusion objective and classifier-free guidance, are maintained and extended in JEN-1 Composer.
Unlike JEN-1, which performs single-track generation and conditioning via latent masking, JEN-1 Composer generalizes this approach to 3 tracks, employing track-specific noising schedules, channel-wise latent concatenation, and curriculum-based training—enabling variable-conditional, joint, and marginal multi-track synthesis in a unified framework (Yao et al., 2023).
6. Limitations, Future Directions, and Open Questions
JEN-1 Composer does not explicitly model music theory constructs such as chord progressions or voice leading, which occasionally results in timbral drift or structural inconsistencies across tracks and refinement iterations. The system's user interface sometimes leaves novices uncertain of optimal track fixation strategies.
Suggested paths for improvement include:
- Integrating symbolic-audio hybrid conditioning (e.g., using chord labels or MIDI prompts).
- Introducing music-theoretic objectives (e.g., chord recognition loss) into curriculum stages.
- Extending the generative temporal context for larger structural coherence (verse/chorus).
Open research questions highlighted include the seamless integration of score-level editing (MIDI) with raw-audio diffusion, learning disentangled latent spaces for granular control (timbre, melody, rhythm), and the design of intuitive user interfaces for non-expert guided co-composition (Yao et al., 2023).
7. Significance and Research Impact
JEN-1 Composer marks a substantial advance in controllable, high-fidelity, multi-track music generation and co-creative workflows. Its unified latent-diffusion methodology, paired with curriculum training and track-wise interactive inference, bridges a longstanding gap between direct multi-track synthesis and the professional practices of DAW-based composition. By attaining state-of-the-art performance in CLAP-based text-to-audio alignment and strong human preference scores, it establishes a foundational framework for future AI-assisted music creation systems, affecting both research exploration and practical music production paradigms (Yao et al., 2023).