ChordFormer Architecture for Audio Chord Recognition
- ChordFormer is a conformer-based deep learning architecture designed for transcribing polyphonic music into structured chord labels by decomposing chords into six musically interpretable components.
- It employs a hybrid local-global sequence modeling approach with Conformer blocks, adaptive reweighting loss, and CRF decoding to address class imbalance and maintain temporal coherence.
- The model utilizes a Constant-Q Transform for high-resolution spectral analysis and achieves state-of-the-art performance in both frame-wise and class-wise chord recognition on benchmark datasets.
ChordFormer is a conformer-based deep learning architecture designed for large-vocabulary audio chord recognition, emphasizing structured chord decomposition, hybrid local-global sequence modeling, and mitigation of class imbalance. The model targets transcription of polyphonic music audio into detailed, musically meaningful chord labels, addressing the challenges posed by the long-tail distribution of chord types and the inherent need to capture both fine spectral structure and extended harmonic context (Akram et al., 17 Feb 2025).
1. Design Objectives and Core Challenges
ChordFormer was developed to transcribe audio into structural chord labels encompassing root+triad, bass, seventh, ninth, eleventh, and thirteenth extensions. The architecture addresses several key challenges:
- Long-tail chord distribution: Many rare chord types and extensions are sparsely represented in datasets, exacerbating class imbalance and limiting recognition performance.
- Contextual modeling: Accurate chord recognition depends on capturing both fine-grained local spectral features (e.g., chord partials, voicing) and long-range harmonic dependencies (e.g., progressions, modulations).
- Structured chord representation: Using a musically meaningful decomposition enhances interpretability and allows for effective parameter sharing and cross-family generalization.
- Class imbalance: Handled explicitly via a re-weighted loss, allowing robust learning even for underrepresented chord types.
2. Input Pipeline and Feature Extraction
ChordFormer processes audio sampled at 22,050 Hz. The primary feature input is a Constant-Q Transform (CQT) spectrogram spanning C1–C8, with 36 bins per octave, resulting in 252 frequency bins per frame. The spectrogram is converted to a decibel scale (librosa’s amplitude_to_db) and normalized to the per-track maximum. The hop length of 512 samples yields a temporal resolution of approximately 23.2 ms per frame. Data augmentation is performed via pitch-shifting each training sample by –5 to +6 semitones, with both spectrograms and chord labels shifted accordingly (Akram et al., 17 Feb 2025).
3. Structured Chord Output Representation
Each time frame is annotated with a 6-dimensional vector
where each element encodes a musically interpretable component:
- : root+triad (13 roots × 7 triads + no-chord “N”)
- : bass pitch (12 chromas + “N”)
- : seventh extension ()
- : ninth extension ()
- : eleventh extension ()
- 0: thirteenth extension (1)
This structured, one-hot-encoded representation allows the problem of large-vocabulary chord recognition to be decomposed into six smaller multiclass classification tasks, reflecting music theory hierarchies and enabling parameter sharing across related chord types (Akram et al., 17 Feb 2025).
4. ChordFormer Model Architecture
4.1 Conformer Blocks
The core of ChordFormer lies in its stack of Conformer blocks, which hybridize convolutional and attention-based sequence modeling. The initial CQT is linearly projected from 252 to 2 dimensions. The architecture contains 3 stacked Conformer blocks, each comprising:
- First half-step Feed-Forward (FFN):
4
utilizing pre-layer normalization, Swish activation, dropout, and residual connections.
- Multi-Head Self-Attention (MHSA): Employs relative sinusoidal positional encoding and pre-norm. Each head 5 computes queries, keys, and values:
6
7
The output is concatenated and projected. The block output updates as
8
- Convolutional module: Involves pre-norm, pointwise convolution (followed by GLU gating), depthwise 1D convolution (kernel size 9), batch normalization, Swish activation, and dropout. The module output is
0
- Second half-step FFN and LayerNorm:
1
Schematic descriptions of these modules correspond to Figure 1 in (Akram et al., 17 Feb 2025).
4.2 Global Sequence Configuration
The overall architectural configuration is:
- Input: linear 252→256 projection, followed by 2 Conformer blocks
- Embedding: 3
- Attention: 4 heads, 5 per head
- FFN inner dim: 1024
- Convolutional kernel: 31, expansion factor 2
- Activation: Swish, with Softmax output
- Dropout: 0.1 after each sublayer
- Residual pre-normalization throughout
4.3 Output Projection and CRF Decoding
The final network state (shape 6) is linearly mapped into six output vectors 7 for 8. Softmax normalization yields per-component probabilities,
9
Decoding is performed not by simple per-frame argmax but via a linear-chain conditional random field (CRF) imposing temporal smoothness. The probability of a chord label sequence 0 given input 1 is modeled as:
2
with emission potential
3
and transition potential
4
where 5 is the indicator function and 6 controls transition penalties.
5. Class Imbalance Mitigation
ChordFormer introduces a weighted cross-entropy objective over all frames 7 and chord components 8:
9
Weights 0 are computed as:
1
where 2 is the count of training samples for label 3 in component 4; 5 controls the balancing tradeoff, and 6 caps the largest class weight. Empirical tuning (e.g., 7) amplifies gradient signals for rare chords, improving class-level accuracy while controlling overemphasis (Akram et al., 17 Feb 2025).
6. Training Protocol and Optimization
ChordFormer is optimized using AdamW with an initial learning rate of 8, subject to a plateau scheduler (decay by 9 after 5 non-improving epochs) and an early stop when the learning rate drops below 0. During training, each epoch for a given song randomly extracts a 1000-frame segment (123.2s), with batch size 24 (total 224,000 frames). Regularization includes dropout (rate 0.1), pre-norm residuals, and batch normalization within convolutional modules. Augmentation is performed as described, with pitch shifts of –5 to +6 semitones (Akram et al., 17 Feb 2025).
7. Empirical Performance and Module Impact
On the Humphrey–Bello 1,217-song corpus (5-fold cross-validation, 60/20/20 split), ChordFormer attained:
- Frame-wise accuracy: 78.77% (vs. CNN+BLSTM 76.76%, +2.01 pp)
- Class-wise accuracy: 38.84% (vs. CNN+BLSTM 33.15%, +5.69 pp)
- MIREX score: 83.62% (vs. CNN+BLSTM 81.52%)
- Breakdown: Root 84.69%, Maj/Min 84.09%, Triads 77.55%, Sevenths 72.28%
Ablation studies revealed:
- Transformer-only: improved global modeling, weaker local spectral detail, triad accuracy 367.8%
- CNN-only: robust to local patterns, lacking long-range context, seventh/extension recall 467.3%
- CNN+BLSTM: incremental improvements over either backbone individually but behind Conformer hybrid
- ChordFormer-R (with reweighted loss): optimally addresses rare-class prediction, with class-wise accuracy peaking at 44.71% for specific weight settings; MIREX improves an additional 0.8% relative to baseline
Increased reweighting (5, 6) improves recall for rare classes (e.g., diminished, augmented, extended chords) with only modest trade-off in overall frame accuracy. Confusion matrices demonstrate that hybrid modeling reduces misclassification among chord extensions.
Summary Table: ChordFormer Distinctives
| Component | Feature/Role | Empirical Impact |
|---|---|---|
| Constant-Q spectrogram | Input representation | High spectral resolution |
| Structured 6-part chord output | Semantic decomposition | Improved interpretability |
| 4-layer Conformer stack | Hybrid local/global context | SOTA accuracy, balanced recall |
| CRF decoder | Temporal coherence | Smoothed predictions |
| Reweighted loss | Class imbalance mitigation | Raised rare-class recall |
Contextually, ChordFormer advances the field of large-vocabulary chord recognition by successfully combining conformer-based sequence modeling, structured chord interpretation, adaptive loss weighting, and temporal CRF smoothing, achieving leading results on benchmark datasets and robust performance across all chord types (Akram et al., 17 Feb 2025).