TinyTTM: Compressed Models
- TinyTTM is a dual model family that includes a compressed text-to-music generator and a lightweight time-series forecasting model designed for resource-constrained scenarios.
- In the text-to-music application, it reduces the parameter count from 557.6M to 89.2M using knowledge distillation and selective architectural minimization while maintaining competitive audio quality.
- The methodology employs a composite loss combining cross-entropy, KL divergence, and MSE to effectively transfer knowledge from larger models to their compressed counterparts.
TinyTTM denotes two distinct but similarly named model families in recent machine learning research: (1) a highly compressed transformer-based text-to-music generation system (Moschopoulos et al., 2024), and (2) a lightweight universal time-series forecasting model known as Tiny Time Mixers (TTM) (Ekambaram et al., 2024). Both approaches are motivated by the need to maximize performance while minimizing model capacity and computational demands, with particular emphasis on resource-constrained or real-time deployment scenarios.
1. Compressed Text-to-Music Generation: TinyTTM
TinyTTM in the context of generative AI for music synthesis is presented as a comprehensive model compression study targeting transformer-based text-to-music (TTM) systems. The reference implementation focuses on compressing MusicGen-Small—one of the state-of-the-art transformer architectures for this modality—down from 557.6M parameters to 89.2M parameters, leveraging knowledge distillation and structural reduction, while maintaining competitive audio generation quality as measured by Fréchet Audio Distance (FAD) and Kullback–Leibler divergence (KL) on the MusicBench evaluation set. No explicit pruning, quantization, or low-rank factorization is applied beyond distillation and selective architectural minimization (Moschopoulos et al., 2024).
Architecture Comparison
| Component | MusicGen-Small | TinyTTM (V2) | Parameter Count (TinyTTM) |
|---|---|---|---|
| Encoder | T5-base (109.6M params) | T5-tiny (4L, fine-tuned) | 11.3M |
| Generative Model | 24-layer AR Transformer (1024d) | 7L Transformer (720d) | 70.5M |
| Decoder | EnCodec (4×conv + 2×LSTM + conv) | Distilled EnCodec | 7.43M |
| Total | 557.6M | 89.2M | - |
The TinyTTM model pipeline consists of a T5-tiny encoder, a 7-layer, 8-head transformer LLM (LM) for latent sequence modeling, and a distilled version of the EnCodec neural codec for audio waveform reconstruction. The encoder is fine-tuned on MusicBench using span-based masked language modeling and cross-entropy. Distillation from the fine-tuned MusicGen-Small is performed using cross-entropy, KL, and intermediate MSE losses with dynamic loss-weight scheduling.
2. Knowledge Distillation and Compression Methodology
The TTM LM student is trained using a composite distillation loss:
where is standard cross-entropy with ground truth targets, is a softened cross-entropy (KL divergence) with the teacher's outputs, and is an intermediate MSE computed between selected teacher and student hidden