Brain2Music: Neural Music Decoding

Updated 26 November 2025

Brain2Music is a research domain that decodes and reconstructs music from neural signals using EEG and fMRI, combining advanced signal processing and deep learning.
It employs techniques such as artifact removal, frequency transformation, and cross-frequency coupling alongside CNNs and regression models to achieve high-accuracy song identification and reconstruction.
The framework supports applications in music recommendation, creative sonification, and neurofeedback by linking neural biomarkers with personalized auditory experiences.

Brain2Music encompasses a set of methodologies, architectures, and scientific insights dedicated to reconstructing, generating, or decoding musical information from human brain activity. Spanning invasive and non-invasive modalities (EEG, fMRI), Brain2Music research targets various applications: identification of musical stimuli, reconstruction of heard or imagined music, personalized recommendations, and generative sonification. This article synthesizes Core methodologies, representative models, and empirical outcomes from contemporary Brain2Music research.

1. Neural Modalities and Datasets

Brain2Music leverages both EEG and fMRI to sample neural correlates of auditory experience:

EEG paradigms: High-density (64–128 channel) and consumer-grade (1–14 channel) EEG systems are used to record neural dynamics during music listening. Commercial datasets include NMED-Tempo (NMED-T: 20 subjects, 10 Western songs) and NMED-Hindi (NMED-H: 12 subjects, 4 Hindi pop songs) (Ramirez-Aristizabal et al., 2022), as well as custom EEG corpora with up to 128-channel geodesic nets and real-world stimuli (Postolache et al., 2024). Sampling rates range from 125 Hz (consumer) up to 1,000 Hz (research-grade).
fMRI paradigms: High-resolution BOLD imaging (2×2×2 mm³ voxels, TR=1.5 s) with stimulus libraries (e.g., GTZan, 540 excerpts × 10 genres) support large-scale decoding and region-of-interest localization (Denk et al., 2023, Ferrante et al., 2024, Liu et al., 2024).

Experimental designs are typically passive listening to temporally aligned music, with epochs ranging from seconds to minutes and with strict time-locking between neural and stimulus events.

2. Signal Processing and Neural Feature Extraction

EEG preprocessing and representation:

Artifact removal and referencing: Independent Component Analysis (ICA) and referencing to average or specific channels are standard practice (Sonawane et al., 2020, Adamos et al., 2016).
Windowing: Temporal slicing (typically 1 s non-overlapping windows) is motivated by the periodic structure of music and the need for large training sets (Sonawane et al., 2020, Ramirez-Aristizabal et al., 2022).
Transformation to frequency domain: Short-Time Fourier Transform (STFT) or spectrogram calculation transforms the windowed EEG to a frequency-channel matrix (e.g., 128 channels × 126 frequency bins), often followed by conversion to mel-bands (Sonawane et al., 2020, Ramirez-Aristizabal et al., 2022).
Cross-frequency coupling (CFC): Phase–amplitude coupling metrics (e.g., modulation index) quantify music-induced cross-band interactions, typically high-β–low-γ coupling for preference decoding (Adamos et al., 2016, Kalaganis et al., 2017).
Sonification: Upsampling EEG to audio rates and modulating carrier tones yield audible "music of the brain," which is compared to the original stimulus using multifractal cross-correlation (Nag et al., 2017).

fMRI processing:

Spatial normalization: Anatomical alignment (MNI), functional normalization (hyperalignment), and region-of-interest (ROI) selection (e.g., auditory cortex, STG, Heschl’s gyrus) precede voxel-wise or ROI-average analysis (Denk et al., 2023, Liu et al., 2024, Ferrante et al., 2024).
Temporal alignment: BOLD response is HRF-shifted (typically by 3 TRs = 4.5 s) and epoch-averaged to correspond with musical stimuli (Denk et al., 2023).

3. Decoding and Reconstruction Architectures

Classification and regression with deep neural networks:

EEG-based CNNs: 2D (frequency × channel) or 3D (consecutive window stacks) convolutional neural networks achieve high-accuracy within-participant identification of song from EEG (84.96 % test accuracy) (Sonawane et al., 2020).
Spectrogram regression: CNN regressors directly learn EEG-to-audio spectrogram mappings, outperforming feature-based pipelines and achieving 80.8 % accuracy for mel spectrogram song-classification (10-way chance: 10 %) (Ramirez-Aristizabal et al., 2022).
Extreme Learning Machines (ELMs): Single-layer architectures map compact EEG feature vectors (CFC, asymmetry, band energy) to continuous preference scores (nRMSE ≈0.06–0.12) for recommendation (Kalaganis et al., 2017).

Latent variable and generative models:

Embedding regression: Linear ridge regression is widely used to map fMRI volumes to pre-trained music and text embeddings (MuLan, CLAP) with mean identification accuracy ≈0.876 (chance: 0.017) (Denk et al., 2023, Ferrante et al., 2024).
Retrieval vs. generation: Predicted embeddings enable (i) retrieval-based approaches (nearest-neighbor in large music embedding databases), or (ii) generative models (e.g., MusicLM, AudioLDM2), producing novel audio from decoded latent representations (Denk et al., 2023, Postolache et al., 2024).
Coarse-to-fine pipelines: fMRI is decoded first into a semantic embedding (CLAP, 512-D), which then conditions finer-grained acoustic latent representations (AudioMAE) used for mel-spectrogram reconstruction via Latent Diffusion Models (LDMs), yielding superior Fréchet Distance (FD), FAD, and KL compared to direct or vanilla models (Liu et al., 2024).

Sonification frameworks:

MIDI mapping engines: Rule-based EEG-to-MIDI systems convert band-specific EEG power, attention/meditation metrics, and cross-frequency coupling into real-time generative music, with deterministic mappings and limited training (Rincon, 2021).
EEG-to-audio via carrier amplitude modulation: Each channel modulates a unique pure tone, rendering neural activity as a polyphonic texture for creative or neurofeedback use (Nag et al., 2017).

4. Empirical Performance and Evaluation Metrics

Identification/classification metrics:

Song ID from EEG: Frequency-domain CNNs yield 84.96 % test accuracy; time-domain CNNs perform at chance (Sonawane et al., 2020).
Spectrogram classification: Reconstructed mel-spectrograms from EEG yield song-name classifier accuracy of 80.8 % (NMED-T), compared to 10 % chance (Ramirez-Aristizabal et al., 2022). Behavioral audio discrimination reaches 85 % in two-alternative forced-choice listening.
fMRI-to-embedding retrieval: Brain2Music achieves ≈0.876 identification accuracy in MuLan space (test set, n=60), significantly outperforming baseline embedding models (Denk et al., 2023, Ferrante et al., 2024).

Music reconstruction metrics:

Similarity and fidelity: Common metrics are Pearson correlation (PCC) and SSIM on mel-spectrograms; Fréchet Distance (FD, FAD) and KL divergence on neural feature distributions (Liu et al., 2024, Postolache et al., 2024).
Qualitative agreement: Top-2 genre, instrument, and mood class agreement on classifiers run over reconstructed vs. ground-truth audio ranges from 45%–68%, well above chance (Denk et al., 2023).
EEG2Mel: SSIM ≈ 0.71 for mel versus ≈ 0.62 for linear spectrograms; mean PSNR ≈10.7 dB for mel (Ramirez-Aristizabal et al., 2022).

5. Applications and Personalization

Recommendation and preference decoding:

Single-sensor EEG biomarkers: High-β/low-γ PAC on left prefrontal (AF3) predicts aesthetic music preference, enabling implicit real-time scoring for playlist annotation or dynamic content-based filtering (mean Spearman ρ≈0.82) (Adamos et al., 2016, Kalaganis et al., 2017).
Real-time integration: Biomarkers can be computed on-device and streamed to adapt user-item matrices in standard recommendation frameworks (Adamos et al., 2016).

Generative/neurofeedback art:

Sonification and generative engines: EEG-to-MIDI/DAW or direct audio mapping supports real-time neurofeedback, creative composition, and multimodal installations (visual, auditory, haptic) (Rincon, 2021, Nag et al., 2017).
Therapeutic potential: Real-time sonified feedback and decoded musical tags may facilitate affect regulation and music therapy protocols (Nag et al., 2017, Adamos et al., 2016).

6. Limitations, Variability, and Future Directions

Subject-individuality and generalization: Cross-participant generalization in EEG is severely limited; signature EEG responses to music are highly idiosyncratic, with cross-subject CNN classification near chance (Sonawane et al., 2020, Ramirez-Aristizabal et al., 2022). In fMRI, anatomical/functional alignment and high-dimensional embedding regression enable robust cross-subject decoding (Ferrante et al., 2024).
Data limitations: Training on small numbers of participants and limited song libraries constrains the generalizability of both EEG and fMRI models (Postolache et al., 2024). Large, diverse, and high-fidelity datasets are needed.
Temporal/frequency resolution tradeoffs: EEG encodes rhythmic and timbral structure rapidly but lacks spatial resolution; fMRI is spatially detailed but poorly resolves temporal dynamics, limiting recovery of rapid musical transients (Denk et al., 2023, Postolache et al., 2024).
Prompting for semantic priors: Coarse-to-fine decoding pipelines can integrate user/text prompts when semantic decoding is unreliable, although for music where fMRI-to-embedding is already accurate, this may slightly degrade per-clip fidelity (Liu et al., 2024).
Model advances: Future directions include hybrid EEG/fMRI pipelines, generative waveform models (DiffWave, WaveGrad), real-time samplers, and adaptive layers to mitigate inter-subject variability (Postolache et al., 2024, Liu et al., 2024, Ferrante et al., 2024). Improved artifact removal, online unsupervised calibration, and closed-loop interactive applications are under exploration.

Key References By Methodological Focus

Modality	Task	Representative Paper	Metric/Result
EEG	Song ID, preference	(Sonawane et al., 2020, Kalaganis et al., 2017, Adamos et al., 2016)	84.96 % accuracy / nRMSE ≈ 0.06
fMRI	Retrieval/generation	(Denk et al., 2023, Ferrante et al., 2024)	id_acc ≈ 0.876
EEG	Reconstruction	(Ramirez-Aristizabal et al., 2022, Postolache et al., 2024)	SSIM ≈ 0.71 / CLAP ≈ 0.60
fMRI	Coarse-to-fine	(Liu et al., 2024)	FD = 6.1 (C2F-LDM), state-of-art
EEG	Sonification	(Nag et al., 2017)	Δγₓ up to –0.5, strong coupling
EEG	Generative art	(Rincon, 2021)	Rule-based mapping, qualitative

7. Conceptual and Scientific Impact

Brain2Music research demonstrates:

Neural data are sufficiently information-rich to permit reliable stimulus identification, high-level semantic decoding (genre, mood, instrumentation), and plausible audio/spectrogram reconstruction.
Neural representations of music are distributed across auditory cortex, with considerable overlap between text-derived, music-derived, and acoustic-semantic embedding spaces (Denk et al., 2023).
The transition from classification to reconstruction requires leveraging pre-trained cross-modal (audio–text) models, ridge regression alignment, and fine/semi-supervised generative decoders.
Personalized and affect-aware music systems leveraging neural signals are technically feasible with low-latency consumer-grade hardware, albeit with substantial limitations in population transfer and stimulus fidelity.

These findings position Brain2Music at the interface of neural decoding, generative models, and music information retrieval, providing both theoretical insight and technological frameworks for future exploration in neural-audio synthesis and brain–machine communication.