Papers
Topics
Authors
Recent
Search
2000 character limit reached

HeartMuLa Music: Adaptive Biofeedback System

Updated 14 April 2026
  • HeartMuLa Music System is a unified framework integrating advanced music foundation models with biofeedback-based control for adaptive music generation.
  • It features a modular architecture with specialized modules (HeartCLAP, HeartTranscriptor, HeartCodec, and HeartMuLa) that enable generative, transcription, and retrieval tasks.
  • The system employs rigorous training protocols and real-time PPG and respiration feedback to modulate audio dynamically, ensuring reproducible performance on academic hardware.

The HeartMuLa Music System is a unified, open-source technological framework for large-scale music understanding, controllable music generation, and interactive physiological feedback, integrating advanced music foundation models and biofeedback-based audio control. The HeartMuLa ecosystem encompasses dual pillars: (i) a family of foundation models—HeartCLAP, HeartTranscriptor, HeartCodec, and HeartMuLa—for generative, transcription, and retrieval tasks; and (ii) real-time physiological feedback pipelines, notably harnessing photoplethysmogram (PPG) heart-rate signals and respiration for dynamic music modulation and interactive biofeedback applications. This system is distinguished by modular architecture, empirically validated physiological effects, and reproducibility on academic hardware budgets (Yang et al., 15 Jan 2026, Easthope, 5 May 2025, Leslie et al., 2019).

1. System Architecture and Model Family

HeartMuLa’s machine-learning subsystem is organized around four tightly integrated modules:

  • HeartCLAP: A music–text dual encoder aligning audio and textual modalities (style tags, lyrics) within a 1024-dimensional embedding space, trained with symmetric InfoNCE loss. It underpins style-tag extraction, zero-shot text–audio retrieval, and model conditioning.
  • HeartTranscriptor: A lyric recognition model built atop Whisper–Medium, fine-tuned on voice-isolated music clips (Demucs separation, WER/CER filtering), providing phoneme-level alignments and intelligibility metrics. Its outputs constitute text-conditioning for generative processes.
  • HeartCodec: A high-fidelity, low-frame-rate “music codec tokenizer” mapping 48 kHz waveforms into 8 residual vector-quantization codebooks per frame (V=8192; 12.5 Hz), capturing both long-range structure and fine acoustic details. It relies on multi-encoder fusion (MuEncoder, WavLM, Whisper), query-based Transformer compression, and rigorous commitment/alignment objectives.
  • HeartMuLa: An LLM-centric autoregressive music generator (Llama 3.2 global backbone, 3B parameters; local decoder, 300M) that, conditioned on style annotations, lyrics, and optional reference audio, predicts HeartCodec token sequences decodable back to audio. The factorization across frames and codebooks enables hierarchical generation, with explicit probability decomposition.

Data flow is multiplexed: [Text/Tags] → HeartCLAP; [Lyrics] → HeartTranscriptor; [Reference Audio] → MuQ–MuLan (style embedding); all conditions combine into HeartMuLa, which outputs discrete tokens for HeartCodec to reconstruct the waveform (Yang et al., 15 Jan 2026).

2. Training Data, Protocol, and Scalability

Model components are pretrained and fine-tuned on curated datasets:

  • HeartCLAP: 1M music–text pairs; trained with attribute/tag-level masking on 8×A100 GPUs, batch 1024 for 50 epochs.
  • HeartTranscriptor: 100k songs (7k hours), voice-filtered; Whisper–Medium fine-tuned on 8×A100 GPUs.
  • HeartCodec: ~600k songs (20–30 s segments); pretraining (8×A100), ReFlow distillation (8×A100), and SQ-Codec finetuning (4×A100) across segment lengths 20.48–29.76 s.
  • HeartMuLa: Warmup (10k h, 30 s clips, 8×A100), pretraining (100k h, 64×A100), SFT (15k h, 8×A100), Direct Preference Optimization (DPO) using MuQ-sim, PER, AudioBox+SongEval metrics, 8×A100.

This stack achieves Suno-level generation and understanding with academic-scale hardware and open data, as code and pretrained weights are available at https://github.com/HeartMuLa/heartlib (Yang et al., 15 Jan 2026).

3. Control Mechanisms and Generation Capabilities

HeartMuLa exposes granular, multimodal control:

  • Style conditioning: Natural-language style tags (genre, mood, instrument, scene, singer timbre, topic, region), encoded within special delimiters, e.g., <tag>pop, guitar, upbeat</tag>.
  • Lyric structure: Lyrics tokenized by LLaMA-3.2, annotated with section markers ([intro], [verse], [chorus], etc.).
  • Reference audio: 10 s audio fragments encoded as style embeddings (MuQ–MuLan), augmenting conditioning vectors.
  • Fine-grained section control: Automated section-wise style annotation (Qwen-2.5-Omni) applied as bracketed descriptors per structure unit, permitting orthogonal specification of dynamics, energy, and timbre.
  • Specialized modes: Short-form generation (10–30 s clips suited to video) by flag-based training/inference scheduling; fine-grained attribute control for per-section variety.

Classifier-free guidance is implemented by random dropout of conditions during training phases; global and local Transformers use rotary embeddings and FlashAttention (Yang et al., 15 Jan 2026).

4. Physiological Feedback Integration: Real-Time Cardio-Music Interfacing

HeartMuLa extends beyond static generation, incorporating a real-time, heart-driven biofeedback loop:

4.1 PPG Sensor and Signal Processing

  • Sensor: Wearable PPG (e.g., Polar Verity Sense); dual-LED (660/850 nm), sample rate 55 Hz, radial artery placement, 16-bit ADC.
  • Preprocessing: Zero-phase 4th-order Butterworth IIR (0.5–5 Hz), optional DC removal, and motion artifact suppression.
  • Peak/Rate Extraction: Local maxima detection (x[n]>x[n1],x[n+1]x[n] > x[n−1], x[n+1], thresholded via μ+kσ\mu + k\sigma over 2 s, k1.2k≈1.2), timestamping, and instantaneous heart rate (HRk=60/(tktk1)HR_k=60/(t_k-t_{k-1})), interpolated at 55 Hz.

4.2 Wireless HCI and Latency

  • BLE Profile: Standard Heart Rate Service (UUID 0x180D/0x2A37), custom PPG service for raw samples; low-latency (≤10 ms per link) over Bluetooth LE; ~43 ms end-to-end sensor-to-audio pipeline.

4.3 Audio Time-Scaling Engine

  • Warp factor: f(HR(t))=HR(t)HRrestf(HR(t))=\frac{HR(t)}{HR_{rest}}, clipped to [0.8,1.5], where HRrestHR_{rest} is a 30 s average baseline.
  • Time Warping Algorithms: (i) Buffer-slice with zero-order hold, or (ii) phase vocoder (STFT-based), with preference for phase-vocoding at f>1.2f >1.2 for perceptual transparency.
  • Software: Modular architecture (sensor acquisition, DSP, control-mapping, real-time callback via PyAudio/PortAudio/JUCE).

Feedback-attenuating smoothing is applied to avoid instability, HRsmooth(t)=αHR(t)+(1α)HRsmooth(t1)HR_{smooth}(t)=\alpha HR(t)+(1-\alpha)HR_{smooth}(t-1), α0.1\alpha≈0.1 (Easthope, 5 May 2025).

4.4 Generative Latent-Space Extension

  • HR_norm features are embedded into latent vectors (zRDz\in\mathbb{R}^D, e.g., μ+kσ\mu + k\sigma0) and injected into VAE/GAN latent codes, enabling higher-dimensional control over generative synthesis pipelines (Easthope, 5 May 2025).

5. Interactive Music, Cardiorespiratory Modulation, and Physiological Outcomes

Integrating real-time physiological data into musical interaction extends both expressive and neuroscience applications:

  • Tempo and Envelope Modulation: HeartMuLa can actively modulate music amplitude/loudness or temporal features at rates derived from baseline or live physiological signals (Leslie et al., 2019).
  • Effects on Physiology: Interactive music constructed with personalized tempo modulation (μ+kσ\mu + k\sigma1 of baseline respiration) significantly slows breathing (ANOVA μ+kσ\mu + k\sigma2, μ+kσ\mu + k\sigma3), increases breath-to-breath variability, reduces electrodermal activity, and increases EEG CNV amplitude at Cz (Leslie et al., 2019).
  • Coherence Feedback: HeartMuLa supports future extension to closed-loop cardio-respiratory coherence, e.g., maximizing respiratory sinus arrhythmia via cross-modal modulation rates.

A plausible implication is that these physiological couplings open pathways for adaptive compositional systems and may enhance well-being with minimal cognitive load.

6. Evaluation, Benchmarking, and Reproduction

HeartMuLa is supported by objective and subjective benchmarking:

  • Objective metrics: AudioBox (CE, CU, PQ), SongEval (coherence, musicality, clarity, naturalness), Tag-Sim (cosine similarity of style-tag and audio embedding), PER (lyric intelligibility).
  • Subjective evaluation: 5-point MOS on musical criteria, using blind expert listeners.
  • Multilingual and transcription benchmarks: HeartBeats (EN, ZH, JP, KO, ES), WikiMT-X (CLAP retrieval), SSLD-200 & HeartBeats-ASR-Bench (transcription).
  • Reproducibility: All substantial results are obtained with academic-scale hardware (≤64×A100 for pretraining), with public code, weights, and recipes (Yang et al., 15 Jan 2026).

A consolidated inference pipeline is available, demonstrating prompt→code→waveform workflows and preference-guided fine-tuning (DPO).

7. Limitations and Practical Recommendations

  • Artifact handling: The physiological subsystem employs adaptive thresholds, exponential smoothing, and robust artifact mitigation for sensor-derived signals (Easthope, 5 May 2025).
  • Modulation constraints: Audio time-scaling is perceptually transparent for factors <1.2; control ranges are clipped to ±25% to reduce listener fatigue.
  • Physiological metrics: Traditional HRV indices are less informative at manipulated, slow breathing rates; spectral refinement is suggested for advanced coherence quantification.
  • Ecological validity: Effects demonstrated in controlled settings should be replicated in field deployments (e.g., ambulatory, workplace) (Leslie et al., 2019).
  • User awareness: Future deployments ought to include subjective assessment to distinguish implicit from volitional physiological adaptation.

HeartMuLa’s design, validation, and open licensing establish it as a reference-class system for music AI and biologically-coupled musical HCIs, with extensive extensibility for both creative and health-related applications (Yang et al., 15 Jan 2026, Easthope, 5 May 2025, Leslie et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HeartMuLa Music System.