ChordFormer Architecture for Audio Chord Recognition

Updated 12 May 2026

ChordFormer is a conformer-based deep learning architecture designed for transcribing polyphonic music into structured chord labels by decomposing chords into six musically interpretable components.
It employs a hybrid local-global sequence modeling approach with Conformer blocks, adaptive reweighting loss, and CRF decoding to address class imbalance and maintain temporal coherence.
The model utilizes a Constant-Q Transform for high-resolution spectral analysis and achieves state-of-the-art performance in both frame-wise and class-wise chord recognition on benchmark datasets.

ChordFormer is a conformer-based deep learning architecture designed for large-vocabulary audio chord recognition, emphasizing structured chord decomposition, hybrid local-global sequence modeling, and mitigation of class imbalance. The model targets transcription of polyphonic music audio into detailed, musically meaningful chord labels, addressing the challenges posed by the long-tail distribution of chord types and the inherent need to capture both fine spectral structure and extended harmonic context (Akram et al., 17 Feb 2025).

1. Design Objectives and Core Challenges

ChordFormer was developed to transcribe audio into structural chord labels encompassing root+triad, bass, seventh, ninth, eleventh, and thirteenth extensions. The architecture addresses several key challenges:

Long-tail chord distribution: Many rare chord types and extensions are sparsely represented in datasets, exacerbating class imbalance and limiting recognition performance.
Contextual modeling: Accurate chord recognition depends on capturing both fine-grained local spectral features (e.g., chord partials, voicing) and long-range harmonic dependencies (e.g., progressions, modulations).
Structured chord representation: Using a musically meaningful decomposition enhances interpretability and allows for effective parameter sharing and cross-family generalization.
Class imbalance: Handled explicitly via a re-weighted loss, allowing robust learning even for underrepresented chord types.

2. Input Pipeline and Feature Extraction

ChordFormer processes audio sampled at 22,050 Hz. The primary feature input is a Constant-Q Transform (CQT) spectrogram spanning C1–C8, with 36 bins per octave, resulting in 252 frequency bins per frame. The spectrogram is converted to a decibel scale (librosa’s amplitude_to_db) and normalized to the per-track maximum. The hop length of 512 samples yields a temporal resolution of approximately 23.2 ms per frame. Data augmentation is performed via pitch-shifting each training sample by –5 to +6 semitones, with both spectrograms and chord labels shifted accordingly (Akram et al., 17 Feb 2025).

3. Structured Chord Output Representation

Each time frame $t$ is annotated with a 6-dimensional vector

$Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$

where each element encodes a musically interpretable component:

$z_1$ : root+triad (13 roots × 7 triads + no-chord “N”)
$z_2$ : bass pitch (12 chromas + “N”)
$z_3$ : seventh extension ( $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ )
$z_4$ : ninth extension ( $\in \{\mathrm{N}, 9, \sharp9, \flat9\}$ )
$z_5$ : eleventh extension ( $\in \{\mathrm{N}, 11, \sharp11\}$ )
$Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 0: thirteenth extension ( $Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 1)

This structured, one-hot-encoded representation allows the problem of large-vocabulary chord recognition to be decomposed into six smaller multiclass classification tasks, reflecting music theory hierarchies and enabling parameter sharing across related chord types (Akram et al., 17 Feb 2025).

4. ChordFormer Model Architecture

4.1 Conformer Blocks

The core of ChordFormer lies in its stack of Conformer blocks, which hybridize convolutional and attention-based sequence modeling. The initial CQT is linearly projected from 252 to $Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 2 dimensions. The architecture contains $Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 3 stacked Conformer blocks, each comprising:

First half-step Feed-Forward (FFN):

$Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 4

utilizing pre-layer normalization, Swish activation, dropout, and residual connections.

Multi-Head Self-Attention (MHSA): Employs relative sinusoidal positional encoding and pre-norm. Each head $Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 5 computes queries, keys, and values:

$Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 6

$Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 7

The output is concatenated and projected. The block output updates as

$Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 8

Convolutional module: Involves pre-norm, pointwise convolution (followed by GLU gating), depthwise 1D convolution (kernel size $Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]$ 9), batch normalization, Swish activation, and dropout. The module output is

$z_1$ 0

Second half-step FFN and LayerNorm:

$z_1$ 1

Schematic descriptions of these modules correspond to Figure 1 in (Akram et al., 17 Feb 2025).

4.2 Global Sequence Configuration

The overall architectural configuration is:

Input: linear 252→256 projection, followed by $z_1$ 2 Conformer blocks
Embedding: $z_1$ 3
Attention: $z_1$ 4 heads, $z_1$ 5 per head
FFN inner dim: 1024
Convolutional kernel: 31, expansion factor 2
Activation: Swish, with Softmax output
Dropout: 0.1 after each sublayer
Residual pre-normalization throughout

4.3 Output Projection and CRF Decoding

The final network state (shape $z_1$ 6) is linearly mapped into six output vectors $z_1$ 7 for $z_1$ 8. Softmax normalization yields per-component probabilities,

$z_1$ 9

Decoding is performed not by simple per-frame argmax but via a linear-chain conditional random field (CRF) imposing temporal smoothness. The probability of a chord label sequence $z_2$ 0 given input $z_2$ 1 is modeled as:

$z_2$ 2

with emission potential

$z_2$ 3

and transition potential

$z_2$ 4

where $z_2$ 5 is the indicator function and $z_2$ 6 controls transition penalties.

5. Class Imbalance Mitigation

ChordFormer introduces a weighted cross-entropy objective over all frames $z_2$ 7 and chord components $z_2$ 8:

$z_2$ 9

Weights $z_3$ 0 are computed as:

$z_3$ 1

where $z_3$ 2 is the count of training samples for label $z_3$ 3 in component $z_3$ 4; $z_3$ 5 controls the balancing tradeoff, and $z_3$ 6 caps the largest class weight. Empirical tuning (e.g., $z_3$ 7) amplifies gradient signals for rare chords, improving class-level accuracy while controlling overemphasis (Akram et al., 17 Feb 2025).

6. Training Protocol and Optimization

ChordFormer is optimized using AdamW with an initial learning rate of $z_3$ 8, subject to a plateau scheduler (decay by $z_3$ 9 after 5 non-improving epochs) and an early stop when the learning rate drops below $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ 0. During training, each epoch for a given song randomly extracts a 1000-frame segment ( $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ 123.2s), with batch size 24 (total $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ 224,000 frames). Regularization includes dropout (rate 0.1), pre-norm residuals, and batch normalization within convolutional modules. Augmentation is performed as described, with pitch shifts of –5 to +6 semitones (Akram et al., 17 Feb 2025).

7. Empirical Performance and Module Impact

On the Humphrey–Bello 1,217-song corpus (5-fold cross-validation, 60/20/20 split), ChordFormer attained:

Frame-wise accuracy: 78.77% (vs. CNN+BLSTM 76.76%, +2.01 pp)
Class-wise accuracy: 38.84% (vs. CNN+BLSTM 33.15%, +5.69 pp)
MIREX score: 83.62% (vs. CNN+BLSTM 81.52%)
Breakdown: Root 84.69%, Maj/Min 84.09%, Triads 77.55%, Sevenths 72.28%

Ablation studies revealed:

Transformer-only: improved global modeling, weaker local spectral detail, triad accuracy $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ 367.8%
CNN-only: robust to local patterns, lacking long-range context, seventh/extension recall $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ 467.3%
CNN+BLSTM: incremental improvements over either backbone individually but behind Conformer hybrid
ChordFormer-R (with reweighted loss): optimally addresses rare-class prediction, with class-wise accuracy peaking at 44.71% for specific weight settings; MIREX improves an additional 0.8% relative to baseline

Increased reweighting ( $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ 5, $\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}$ 6) improves recall for rare classes (e.g., diminished, augmented, extended chords) with only modest trade-off in overall frame accuracy. Confusion matrices demonstrate that hybrid modeling reduces misclassification among chord extensions.

Summary Table: ChordFormer Distinctives

Component	Feature/Role	Empirical Impact
Constant-Q spectrogram	Input representation	High spectral resolution
Structured 6-part chord output	Semantic decomposition	Improved interpretability
4-layer Conformer stack	Hybrid local/global context	SOTA accuracy, balanced recall
CRF decoder	Temporal coherence	Smoothed predictions
Reweighted loss	Class imbalance mitigation	Raised rare-class recall

Contextually, ChordFormer advances the field of large-vocabulary chord recognition by successfully combining conformer-based sequence modeling, structured chord interpretation, adaptive loss weighting, and temporal CRF smoothing, achieving leading results on benchmark datasets and robust performance across all chord types (Akram et al., 17 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChordFormer Architecture.