HeartCodec: ECG & Music Compression
- HeartCodec is a dual-domain codec that applies advanced transform, quantization, and encoding techniques for both ECG and music signal compression.
- In ECG, it restructures quasi-periodic signals into 2D arrays and uses mixed wavelet and cosine transforms to nearly double the compression ratio over traditional 1D methods.
- For music, it tokenizes audio at 12.5 Hz using residual vector quantization and a diffusion transformer decoder, enabling efficient modeling of long and detailed sequences.
HeartCodec encompasses two distinct codec architectures targeting efficient and high-fidelity compression in different domains: 2D ECG signal processing for biomedical telemetry (Chagnon et al., 2019) and ultra-low-frame-rate audio tokenization for music generation in foundation model settings (Yang et al., 15 Jan 2026). The term denotes a class of codecs that apply advanced transform, quantization, and encoding techniques to achieve significant reductions in bit rate while preserving semantic and perceptual fidelity. This entry focuses on both the original 2D mixed-transform codec for electrocardiogram (ECG) compression and the HeartMuLa music codec tokenizer, providing a comparative technical exposition.
1. Origins and Conceptual Motivation
HeartCodec in ECG Compression. The 2D HeartCodec was developed to address the inefficiency in compressing quasi-periodic 1D ECG records. Traditional 1D compression fails to exploit the cross-beat structural redundancy inherent in ECG signals. By realigning heartbeats as rows to form a 2D array, the codec captures both intra-beat waveform details and inter-beat morphology, supporting transform-based joint compression in orthogonal dimensions (Chagnon et al., 2019).
HeartCodec in Generative Music Systems. The eponymous component within the HeartMuLa foundation model suite generalizes the codec concept to the music domain. Here, the need arises from the requirement to model lengthy musical sequences with both long-range structure and fine-grained acoustic detail. HeartCodec tokenizes audio at an ultra-low frame rate (12.5 Hz), providing efficient discrete representations suitable for large-scale autoregressive LLM generation (Yang et al., 15 Jan 2026).
2. Technical Architecture: ECG Domain
2.1 Pre-Processing and 2D Construction
- Beat Segmentation: R-peak detection (e.g., Pan–Tompkins algorithm) is used to locate heartbeats, which are then extracted and zero-padded to uniform length, forming a 2D array (rows: individual beats, columns: aligned time samples per beat).
2.2 Mixed-Transform Coding Pipeline
- Row-wise Discrete Wavelet Transform (DWT): Each row is decomposed using the Cohen–Daubechies–Feauveau 9/7 biorthogonal wavelet (cdf97) in either a 4- or 6-level critically sampled scheme. The forward transform:
These coefficients yield compact representations of heartbeat morphologies.
- Column-wise Discrete Cosine Transform (DCT-II): For each column post-DWT, a length- DCT-II is applied:
with normalization .
- Separable Joint Transform: The complete 2D transform can be written as .
2.3 Quantization and Compression
- Mid-tread uniform quantization is applied to flatten the transform coefficients, using
- Sparse-index representation: Nonzero quantized coefficients are collected as triplets (absolute value, sign, position/difference).
- Entropy coding: Huffman coding is performed over all value, sign, and positional difference streams, along with required side information (array dimensions, mean, quantizer parameters).
2.4 Reconstruction
- Inverse transforms (IDCT-II, then cdf97 inverse DWT) are applied, followed by reassembly of 2D segments into the 1D ECG record.
3. Technical Architecture: Music Foundation Models
3.1 Semantic-Rich Multi-Encoder Stack
- Inputs: Stereo waveforms at 48 kHz, processed by four frozen pretrained feature extractors:
- Feature fusion: All extracted representations are temporally resampled to 25 Hz, concatenated, then linearly projected.
3.2 Ultra-Low-Frame-Rate Tokenization
- Query-based quantization: Introduces a learnable “[Q]” token after every two frames, processes features via a Transformer encoder, and retains only those positions. This downsampling yields a 12.5 Hz sequence.
- Residual Vector Quantization (RVQ): Eight stacked codebooks (vocabulary 8192 each) encode the discrete trajectory:
Commitment and alignment losses ensure stability and semantic retention.
3.3 High-Fidelity Reconstruction Decoder
- Target latents: Continuous representations from a 25 Hz SQ-Codec tokenizer.
- Flow-matching module: Diffusion Transformer (LLaMA-3 backbone, 1.5B parameters) learns to map Gaussian noise to latent space, conditioned on quantized embeddings and masked latents.
- ReFlow distillation: Reduces sampling steps and accelerates inference.
- SQ-Codec fine-tuning: Final waveform reconstruction is supervised using L1 and STFT losses plus adversarial regularization.
4. Performance and Comparative Benchmarks
The following table summarizes HeartCodec’s performance relative to prior audio codecs (selected metrics from (Yang et al., 15 Jan 2026)):
| Model | Frame Rate (Hz) | Bitrate (kbps) | VISQOL↑ | FAD↓ | FD↓ | STOI↑ | PESQ↑ | WER↓ |
|---|---|---|---|---|---|---|---|---|
| SemantiCodec | 25 | 0.375 | 2.24 | 2.32 | 22.38 | 0.40 | 1.14/1.44 | 0.91 |
| XCodec (8×1024) | 50 | 4.00 | 2.35 | 0.70 | 14.78 | 0.74 | 1.87/2.62 | 0.27 |
| MuCodec | 25 | 0.35 | 3.07 | 1.02 | 14.73 | 0.45 | 1.12/1.36 | 0.54 |
| LeVo (dual) | 25 | 0.70 | 3.26 | 1.45 | 19.96 | 0.56 | 1.21/1.61 | 0.35 |
| HeartCodec (SQ Ft) | 12.5 | 1.30 | 3.72 | 0.27 | 11.06 | 0.66 | 1.52/2.10 | 0.26 |
At a ~1.3 kbps bitrate and 12.5 Hz frame rate, HeartCodec achieves top VISQOL and FAD (closest to ground truth), with intelligibility and word error rate on par or better than much higher bitrate baselines (Yang et al., 15 Jan 2026).
In ECG, at typical clinical distortion (PRD≈1%), the 2D HeartCodec achieves a mean compression ratio (CR) of 85±50, nearly doubling the 1D baseline (CR=42±12) (Chagnon et al., 2019).
5. Gains, Trade-Offs, and Limitations
Compression Gains
- ECG Domain: Combining wavelet (non-stationary beat shape) and cosine (inter-beat redundancy) transforms yields coefficient concentration superior to full 2D DWT. This results in significantly higher CR at a fixed PRD compared to 1D wavelet or previous 2D SPIHT codecs (Chagnon et al., 2019).
- Music Domain: Ultra-low token rates enable efficient long-context modeling, augmenting semantic and spectral fidelity via multi-level encoder features and RVQ, despite a drastic reduction of sequence length (Yang et al., 15 Jan 2026).
Trade-Offs
- ECG: Pre- and post-processing (segmentation, reshaping, dimensionality handling) and more complex entropy coding increase overhead, but CR gains dominate at clinical PRDs.
- Highly Irregular Signals: In the ECG setting, clinical irregularity reduces 2D compression advantage due to decreased inter-beat redundancy.
- Music: The transition to ultra-low frame rate must balance autoregressive modeling tractability against temporal granularity.
Limitations
- For PRD < 0.4% (ECG): Efficiency loss arises from encoding low-amplitude baseline noise, negating transform concentration.
- Dataset and Training Cost (Music): Reproduction of HeartCodec requires extensive data and GPU resources, though code and checkpoints are public.
6. Implementation, Training, and Reproducibility
- ECG HeartCodec: Operates over MIT-BIH Arrhythmia records (48 records, 360 Hz, 650,000 samples/record), with QRS detection via Matlab Pan–Tompkins. Quantization, sparse-indexing, and Huffman coding comprise the compression backend; all workflow steps and transformations are described in (Chagnon et al., 2019).
- Music HeartCodec: Pretraining uses ~600,000 songs (20.5 s segments), with model parameters as follows: batch size 160 (8×A100 GPUs), AdamW optimizer (lr=1e-4, cosine schedule), and a ~1.5B parameter diffusion Transformer as decoder. Open-source code and checkpoints are available, enabling reproduction on substitute datasets, subject to hardware availability (Yang et al., 15 Jan 2026).
7. Extensions and Outlook
For ECG, plausible extensions include adaptive wavelet-level selection, refined bit-plane coding, and post-processing of residuals using learned models. For music, further decreasing token rate, enhancing transformer context windows, or incorporating joint visual or textual conditioning may be promising. Both HeartCodecs exemplify the integration of signal-structure exploitation (ECG: heartbeat alignment; Music: multi-level semantic encoding) with advanced quantization, making them templates for compression and autoregressive modeling in high-value biomedical and audio domains (Chagnon et al., 2019, Yang et al., 15 Jan 2026).