Papers
Topics
Authors
Recent
Search
2000 character limit reached

JEN-1 Composer: Multi-Track Music Generation

Updated 1 May 2026
  • JEN-1 Composer is a unified framework for high-fidelity, controllable multi-track music generation using audio latent diffusion, extending single-track methods to jointly synthesize tracks.
  • It employs a progressive curriculum training strategy that incrementally teaches the model to generate from marginal, conditional, and joint multi-track distributions.
  • The system supports interactive inference with cross-attention for text guidance, allowing iterative track refinement and meeting state-of-the-art text-to-music generation benchmarks.

JEN-1 Composer is a unified framework for high-fidelity, controllable multi-track music generation based on audio latent diffusion modeling. It extends single-track diffusion-based paradigms (notably JEN-1) to a multi-track setting, enabling marginal, conditional, and joint synthesis of aligned musical tracks within one model. By employing a progressive curriculum training strategy, JEN-1 Composer supports flexible Human–AI co-composition workflows and achieves state-of-the-art performance on text-to-music multi-track generation tasks (Yao et al., 2023). The system integrates tightly with text guidance via cross-attention and supports granular, iterative track manipulation, accommodating professional compositional use cases.

1. Model Architecture and Diffusion Foundation

JEN-1 Composer generalizes the single-track latent diffusion backbone of JEN-1 to multi-track music generation. Each of the NN input tracks (e.g., bass, drums, melody, instrument) is encoded into a shared latent space using a pre-trained encoder fϕf_\phi, and the resulting latent tensors are concatenated channel-wise. The backbone is a 1D U-Net architecture augmented with cross-attention to text features (from a frozen FLAN-T5 encoder) and per-track timestep embeddings.

The forward diffusion process for each latent z0Rc×Tz_0 \in \mathbb{R}^{c \times T} is:

zt=αˉtz0+1αˉtϵ,ϵN(0,I),z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1-\bar{\alpha}_t} \,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I),

where αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i is the cumulative noise schedule. The reverse (denoising) model ϵθ(zt,t)\epsilon_\theta(z_t, t) is trained with L2 loss:

L(θ)=Et,z0,ϵ[ϵϵθ(zt,t)22].L(\theta) = \mathbb{E}_{t, z_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(z_t, t)\|_2^2\right].

At inference, denoising adopts a DDPM-style update, with μθ(zt,t)\mu_\theta(z_t, t) and variance determined by the learned noise and schedule.

To accommodate marginal, conditional, and joint multi-track distributions within a single network, each track ii is assigned its own timestep ti{0,1,,T}t_i \in \{0, 1, \ldots, T\}:

  • fϕf_\phi0: Track fϕf_\phi1 is unperturbed, used as a conditioning signal.
  • fϕf_\phi2: Track fϕf_\phi3 is jointly reconstructed.
  • fϕf_\phi4: Track fϕf_\phi5 is fully noised, treated as a target for marginal generation.

Sampling the vector fϕf_\phi6 during training enables the model to learn the entire spectrum of marginals, conditionals, and joint distributions (Yao et al., 2023).

2. Progressive Curriculum Training Strategy

To enhance generalization and controllability, JEN-1 Composer employs a progressive curriculum training approach. For fϕf_\phi7 tracks, training proceeds in fϕf_\phi8 stages:

  • Stage 1: Randomly select 1 track for generation, condition on fϕf_\phi9 clean tracks.
  • Stage 2: Generate 2 tracks, condition on z0Rc×Tz_0 \in \mathbb{R}^{c \times T}0, etc.
  • Stage N: Generate all tracks jointly from noise.

Stage selection at each iteration is governed by a non-decreasing probability schedule z0Rc×Tz_0 \in \mathbb{R}^{c \times T}1; harder stages become more frequent over time. A typical schedule uses:

z0Rc×Tz_0 \in \mathbb{R}^{c \times T}2

where z0Rc×Tz_0 \in \mathbb{R}^{c \times T}3 controls the sharpness of the curriculum. The training loss for subset z0Rc×Tz_0 \in \mathbb{R}^{c \times T}4 of tracks to be generated is:

z0Rc×Tz_0 \in \mathbb{R}^{c \times T}5

where z0Rc×Tz_0 \in \mathbb{R}^{c \times T}6 are noised latents for tracks in z0Rc×Tz_0 \in \mathbb{R}^{c \times T}7, z0Rc×Tz_0 \in \mathbb{R}^{c \times T}8 are clean latents, and z0Rc×Tz_0 \in \mathbb{R}^{c \times T}9 aggregates all timesteps.

Maintaining nonzero zt=αˉtz0+1αˉtϵ,ϵN(0,I),z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1-\bar{\alpha}_t} \,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I),0 for all curriculum stages preserves model ability for easier (conditional) tasks while emphasizing joint generation as training progresses.

3. Interactive Inference and Human–AI Co-composition

JEN-1 Composer supports an interactive workflow, aligning with professional music production:

Sampling Procedure (Pseudocode Overview)

zt=αˉtz0+1αˉtϵ,ϵN(0,I),z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1-\bar{\alpha}_t} \,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I),4 Tracks set to zt=αˉtz0+1αˉtϵ,ϵN(0,I),z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1-\bar{\alpha}_t} \,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I),1 are held fixed, and only tracks designated for regeneration are denoised. Classifier-free guidance can be applied to tune the text conditioning during this process.

This co-compositional approach allows users to iteratively refine, fix, or regenerate subcomponents (e.g., drum, bass, melody), supporting granular creative workflows (Yao et al., 2023).

4. Datasets, Metrics, and Experimental Results

JEN-1 Composer was trained and evaluated on 800 hours of private multi-track studio recordings (5 aligned tracks: bass, drums, instrument, melody, mix), with splits of 80% train and 20% test. Each sample includes rich text metadata describing genre, tempo, mood, and instrumentation.

Metrics:

  • CLAP alignment score: between text and generated audio, per track and for mixed sum.
  • Human Relative Preference Ratio (RPR): percentage of pairwise preference wins over other models.

Baselines: MusicLM, MusicGen, Jen-1 (single-track).

Quantitative Results

Methods Bass Drums Inst. Melody Mixed RPR
MusicLM 0.16 0.17 0.23 0.28 0.28 27%
MusicGen 0.17 0.15 0.25 0.33 0.35 36%
Jen-1 0.19 0.16 0.29 0.32 0.36 40%
JEN-1 Composer 0.21 0.18 0.29 0.36 0.39

Ablation studies isolate the effect of per-track timestep vectors, curriculum training, and interactive inference, showing incremental improvements in CLAP and RPR scores.

Qualitative analyses indicate effective co-creation in genres such as jazz (iterative drum-bass-melody refinement), electronic dance (percussive anchors and dynamic synth shaping), and orchestral mock-up (piano motif to strings/brass generation). Real-time demos on A100 GPUs indicate ~5 s latency for 30-second, 4-track generations.

5. Relationship to JEN-1 and Prior Work

JEN-1 Composer is a direct descendant of the JEN-1 model (Li et al., 2023), which demonstrated universal text-to-music generation via omnidirectional diffusion over audio latents. JEN-1 employs a masked autoencoder for end-to-end latent representation, a 1D U-Net with cross-attention to FLAN-T5 embeddings, and multi-task heads for text-to-music, inpainting, and continuation. Key formulations, such as the zt=αˉtz0+1αˉtϵ,ϵN(0,I),z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1-\bar{\alpha}_t} \,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I),2-matching diffusion objective and classifier-free guidance, are maintained and extended in JEN-1 Composer.

Unlike JEN-1, which performs single-track generation and conditioning via latent masking, JEN-1 Composer generalizes this approach to zt=αˉtz0+1αˉtϵ,ϵN(0,I),z_t = \sqrt{\bar{\alpha}_t}\, z_0 + \sqrt{1-\bar{\alpha}_t} \,\epsilon, \qquad \epsilon \sim \mathcal{N}(0, I),3 tracks, employing track-specific noising schedules, channel-wise latent concatenation, and curriculum-based training—enabling variable-conditional, joint, and marginal multi-track synthesis in a unified framework (Yao et al., 2023).

6. Limitations, Future Directions, and Open Questions

JEN-1 Composer does not explicitly model music theory constructs such as chord progressions or voice leading, which occasionally results in timbral drift or structural inconsistencies across tracks and refinement iterations. The system's user interface sometimes leaves novices uncertain of optimal track fixation strategies.

Suggested paths for improvement include:

  • Integrating symbolic-audio hybrid conditioning (e.g., using chord labels or MIDI prompts).
  • Introducing music-theoretic objectives (e.g., chord recognition loss) into curriculum stages.
  • Extending the generative temporal context for larger structural coherence (verse/chorus).

Open research questions highlighted include the seamless integration of score-level editing (MIDI) with raw-audio diffusion, learning disentangled latent spaces for granular control (timbre, melody, rhythm), and the design of intuitive user interfaces for non-expert guided co-composition (Yao et al., 2023).

7. Significance and Research Impact

JEN-1 Composer marks a substantial advance in controllable, high-fidelity, multi-track music generation and co-creative workflows. Its unified latent-diffusion methodology, paired with curriculum training and track-wise interactive inference, bridges a longstanding gap between direct multi-track synthesis and the professional practices of DAW-based composition. By attaining state-of-the-art performance in CLAP-based text-to-audio alignment and strong human preference scores, it establishes a foundational framework for future AI-assisted music creation systems, affecting both research exploration and practical music production paradigms (Yao et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JEN-1 Composer.