Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChordFormer Architecture for Audio Chord Recognition

Updated 12 May 2026
  • ChordFormer is a conformer-based deep learning architecture designed for transcribing polyphonic music into structured chord labels by decomposing chords into six musically interpretable components.
  • It employs a hybrid local-global sequence modeling approach with Conformer blocks, adaptive reweighting loss, and CRF decoding to address class imbalance and maintain temporal coherence.
  • The model utilizes a Constant-Q Transform for high-resolution spectral analysis and achieves state-of-the-art performance in both frame-wise and class-wise chord recognition on benchmark datasets.

ChordFormer is a conformer-based deep learning architecture designed for large-vocabulary audio chord recognition, emphasizing structured chord decomposition, hybrid local-global sequence modeling, and mitigation of class imbalance. The model targets transcription of polyphonic music audio into detailed, musically meaningful chord labels, addressing the challenges posed by the long-tail distribution of chord types and the inherent need to capture both fine spectral structure and extended harmonic context (Akram et al., 17 Feb 2025).

1. Design Objectives and Core Challenges

ChordFormer was developed to transcribe audio into structural chord labels encompassing root+triad, bass, seventh, ninth, eleventh, and thirteenth extensions. The architecture addresses several key challenges:

  • Long-tail chord distribution: Many rare chord types and extensions are sparsely represented in datasets, exacerbating class imbalance and limiting recognition performance.
  • Contextual modeling: Accurate chord recognition depends on capturing both fine-grained local spectral features (e.g., chord partials, voicing) and long-range harmonic dependencies (e.g., progressions, modulations).
  • Structured chord representation: Using a musically meaningful decomposition enhances interpretability and allows for effective parameter sharing and cross-family generalization.
  • Class imbalance: Handled explicitly via a re-weighted loss, allowing robust learning even for underrepresented chord types.

2. Input Pipeline and Feature Extraction

ChordFormer processes audio sampled at 22,050 Hz. The primary feature input is a Constant-Q Transform (CQT) spectrogram spanning C1–C8, with 36 bins per octave, resulting in 252 frequency bins per frame. The spectrogram is converted to a decibel scale (librosa’s amplitude_to_db) and normalized to the per-track maximum. The hop length of 512 samples yields a temporal resolution of approximately 23.2 ms per frame. Data augmentation is performed via pitch-shifting each training sample by –5 to +6 semitones, with both spectrograms and chord labels shifted accordingly (Akram et al., 17 Feb 2025).

3. Structured Chord Output Representation

Each time frame tt is annotated with a 6-dimensional vector

Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]

where each element encodes a musically interpretable component:

  • z1z_1: root+triad (13 roots × 7 triads + no-chord “N”)
  • z2z_2: bass pitch (12 chromas + “N”)
  • z3z_3: seventh extension ({N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\})
  • z4z_4: ninth extension ({N,9,9,9}\in \{\mathrm{N}, 9, \sharp9, \flat9\})
  • z5z_5: eleventh extension ({N,11,11}\in \{\mathrm{N}, 11, \sharp11\})
  • Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]0: thirteenth extension (Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]1)

This structured, one-hot-encoded representation allows the problem of large-vocabulary chord recognition to be decomposed into six smaller multiclass classification tasks, reflecting music theory hierarchies and enabling parameter sharing across related chord types (Akram et al., 17 Feb 2025).

4. ChordFormer Model Architecture

4.1 Conformer Blocks

The core of ChordFormer lies in its stack of Conformer blocks, which hybridize convolutional and attention-based sequence modeling. The initial CQT is linearly projected from 252 to Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]2 dimensions. The architecture contains Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]3 stacked Conformer blocks, each comprising:

  • First half-step Feed-Forward (FFN):

Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]4

utilizing pre-layer normalization, Swish activation, dropout, and residual connections.

  • Multi-Head Self-Attention (MHSA): Employs relative sinusoidal positional encoding and pre-norm. Each head Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]5 computes queries, keys, and values:

Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]6

Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]7

The output is concatenated and projected. The block output updates as

Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]8

  • Convolutional module: Involves pre-norm, pointwise convolution (followed by GLU gating), depthwise 1D convolution (kernel size Z(t)=[z1(t),z2(t),z3(t),z4(t),z5(t),z6(t)]Z^{(t)} = [ z_1^{(t)}, z_2^{(t)}, z_3^{(t)}, z_4^{(t)}, z_5^{(t)}, z_6^{(t)} ]9), batch normalization, Swish activation, and dropout. The module output is

z1z_10

  • Second half-step FFN and LayerNorm:

z1z_11

Schematic descriptions of these modules correspond to Figure 1 in (Akram et al., 17 Feb 2025).

4.2 Global Sequence Configuration

The overall architectural configuration is:

  • Input: linear 252→256 projection, followed by z1z_12 Conformer blocks
  • Embedding: z1z_13
  • Attention: z1z_14 heads, z1z_15 per head
  • FFN inner dim: 1024
  • Convolutional kernel: 31, expansion factor 2
  • Activation: Swish, with Softmax output
  • Dropout: 0.1 after each sublayer
  • Residual pre-normalization throughout

4.3 Output Projection and CRF Decoding

The final network state (shape z1z_16) is linearly mapped into six output vectors z1z_17 for z1z_18. Softmax normalization yields per-component probabilities,

z1z_19

Decoding is performed not by simple per-frame argmax but via a linear-chain conditional random field (CRF) imposing temporal smoothness. The probability of a chord label sequence z2z_20 given input z2z_21 is modeled as:

z2z_22

with emission potential

z2z_23

and transition potential

z2z_24

where z2z_25 is the indicator function and z2z_26 controls transition penalties.

5. Class Imbalance Mitigation

ChordFormer introduces a weighted cross-entropy objective over all frames z2z_27 and chord components z2z_28:

z2z_29

Weights z3z_30 are computed as:

z3z_31

where z3z_32 is the count of training samples for label z3z_33 in component z3z_34; z3z_35 controls the balancing tradeoff, and z3z_36 caps the largest class weight. Empirical tuning (e.g., z3z_37) amplifies gradient signals for rare chords, improving class-level accuracy while controlling overemphasis (Akram et al., 17 Feb 2025).

6. Training Protocol and Optimization

ChordFormer is optimized using AdamW with an initial learning rate of z3z_38, subject to a plateau scheduler (decay by z3z_39 after 5 non-improving epochs) and an early stop when the learning rate drops below {N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}0. During training, each epoch for a given song randomly extracts a 1000-frame segment ({N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}123.2s), with batch size 24 (total {N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}224,000 frames). Regularization includes dropout (rate 0.1), pre-norm residuals, and batch normalization within convolutional modules. Augmentation is performed as described, with pitch shifts of –5 to +6 semitones (Akram et al., 17 Feb 2025).

7. Empirical Performance and Module Impact

On the Humphrey–Bello 1,217-song corpus (5-fold cross-validation, 60/20/20 split), ChordFormer attained:

  • Frame-wise accuracy: 78.77% (vs. CNN+BLSTM 76.76%, +2.01 pp)
  • Class-wise accuracy: 38.84% (vs. CNN+BLSTM 33.15%, +5.69 pp)
  • MIREX score: 83.62% (vs. CNN+BLSTM 81.52%)
  • Breakdown: Root 84.69%, Maj/Min 84.09%, Triads 77.55%, Sevenths 72.28%

Ablation studies revealed:

  • Transformer-only: improved global modeling, weaker local spectral detail, triad accuracy {N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}367.8%
  • CNN-only: robust to local patterns, lacking long-range context, seventh/extension recall {N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}467.3%
  • CNN+BLSTM: incremental improvements over either backbone individually but behind Conformer hybrid
  • ChordFormer-R (with reweighted loss): optimally addresses rare-class prediction, with class-wise accuracy peaking at 44.71% for specific weight settings; MIREX improves an additional 0.8% relative to baseline

Increased reweighting ({N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}5, {N,7,7,7}\in \{\mathrm{N}, 7, \flat7, \flat\flat7\}6) improves recall for rare classes (e.g., diminished, augmented, extended chords) with only modest trade-off in overall frame accuracy. Confusion matrices demonstrate that hybrid modeling reduces misclassification among chord extensions.

Summary Table: ChordFormer Distinctives

Component Feature/Role Empirical Impact
Constant-Q spectrogram Input representation High spectral resolution
Structured 6-part chord output Semantic decomposition Improved interpretability
4-layer Conformer stack Hybrid local/global context SOTA accuracy, balanced recall
CRF decoder Temporal coherence Smoothed predictions
Reweighted loss Class imbalance mitigation Raised rare-class recall

Contextually, ChordFormer advances the field of large-vocabulary chord recognition by successfully combining conformer-based sequence modeling, structured chord interpretation, adaptive loss weighting, and temporal CRF smoothing, achieving leading results on benchmark datasets and robust performance across all chord types (Akram et al., 17 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChordFormer Architecture.