Papers
Topics
Authors
Recent
2000 character limit reached

Synthetic Music Forensics

Updated 6 December 2025
  • Synthetic music forensics is a field dedicated to detecting and attributing AI-generated music through audio signal processing, machine learning, and digital forensics.
  • It integrates both audio and symbolic modalities with diverse benchmark datasets and advanced deep learning models to identify synthetic fingerprints in musical pieces.
  • Researchers confront challenges such as robustness to audio transformations and inter-generator attribution, spurring efforts towards augmentation-aware and continual learning solutions.

Synthetic music forensics is a research field focused on the detection, attribution, and analysis of music created or manipulated by AI systems. The discipline addresses the challenges introduced by modern generative models capable of producing high-fidelity, stylistically convincing audio, symbolic music, and even entire songs with complex structure, raising concerns in copyright enforcement, artist attribution, fraud prevention, and artistic integrity on digital platforms. Synthetic music forensics synthesizes methodologies from audio signal processing, machine learning, computational musicology, and digital forensics, aiming to distinguish authentic musical works from those produced, altered, or replicated by generative algorithms.

1. Foundations and Problem Formulation

Synthetic music forensics operates on a dual axis: authenticity detection (real vs. AI-generated) and attribution (identifying the tools, models, or specific transformation pipelines responsible). The field encompasses:

The core task is to discriminate between samples from the distribution of authentic music, PrealP_{\rm real}, and those from PsynthP_{\rm synth}, where the latter may include end-to-end generated audio, neural reconstructions, or synthetic symbolic music. Robust forensics must further assign samples to the appropriate source (closed-set attribution) or flag as originating from unknown sources (open-set detection) (Comanducci et al., 16 Sep 2024).

2. Benchmark Corpora and Dataset Engineering

Rigorous evaluation in synthetic music forensics requires large, diverse, and well-designed datasets. Several public benchmarks exemplify the data-centric approach:

Dataset Scale Modalities Generators
FMA (used in (Afchar et al., 7 May 2024)) 25k real, 225k fake Audio (44.1 kHz MP3/WAV) Encodec, DAC, GriffinMel, Musika
SONICS (Rahman et al., 26 Aug 2024) 97k songs, 4.8k hrs Audio, text-lyrics Suno, Udio
MoM (Batra et al., 29 Nov 2025) 130k tracks, 6.6k hrs Audio Suno (v1–v4), Udio, Riffusion, Diffrythm, Yue
FakeMusicCaps (Comanducci et al., 16 Sep 2024) 27.6k synthetic Audio (16 kHz, mono) MusicGen, AudioLDM, Stable Audio, Mustango, MusicLDM
JS Fake Chorales (Peracha, 2021) 500 synthetic MIDI Symbolic (MIDI) KS_Chorus deep-RNN
POP1K7, POP909 (Ji et al., 17 Sep 2025) 2.6k piano covers Symbolic (MIDI) Various

Dataset design principles include balanced real/synthetic splits, coverage of diverse genres and styles, inclusion of both short (few seconds) and long (full song) excerpts, and explicit metadata (genre, mood, topic, lyrics).

A plausible implication is that forensic benchmarks must evolve rapidly to remain relevant, as new generators and methods emerge. The construction of OOD (out-of-distribution) splits, as in MoM (Batra et al., 29 Nov 2025), is critical for realistically modeling the "generator shift" encountered in real-world deployments.

3. Detection Models and Feature Engineering

Supervised Music Deepfake Detectors. Early approaches use compact convolutional neural networks (CNNs) trained on amplitude spectrograms of short audio segments. The core transformation is the (complex) Short-Time Fourier Transform (STFT):

S(t,f)=∣∑nx[n]⋅w[n−t]⋅e−j2πfn∣2S(t, f) = \left| \sum_n x[n] \cdot w[n-t] \cdot e^{-j2\pi f n} \right|^2

Empirically, decibel-scaled amplitude spectrograms produce best detection performance, achieving up to 99.8% accuracy in closed-set scenarios (Afchar et al., 7 May 2024, Afchar et al., 17 Jan 2025).

Advanced Architectures. To address long-range temporal dependencies and large input sizes, hybrid spectro-temporal transformer architectures (e.g., SpecTTTra (Rahman et al., 26 Aug 2024)) slice mel-spectrograms along time and frequency, tokenize with 1D convolutions, and encode with full-transformer attention blocks. The relative token count and memory advantage enables efficient modeling of entire songs.

Dual-stream encoders (e.g., CLAM (Batra et al., 29 Nov 2025)) employ parallel music-expert (MERT) and speech-expert (Wave2Vec2) pre-trained models, fused via cross-attention layers. This enables modeling of subtle inconsistencies between vocal and instrumental streams and leverages multi-scale representations.

Feature engineering also includes cepstral representations (LFCC, MFCC, CQCC) for capturing vocoder fingerprints (Yan et al., 2022), 80-dimensional MFCC vectors for stereo-forgery detection (Liu et al., 2021), or symbolic representations (piano-roll, velocity, structural statistics) for plagiarism assessment (Ji et al., 17 Sep 2025).

Contrastive and Attributional Methods. Contrasting techniques such as dual-loss training (binary cross-entropy plus triplet loss) enforce global separation of real/fake and local alignment of vocal/instrument streams (Batra et al., 29 Nov 2025). Attribution tasks leverage multi-class classifiers or thresholding protocols to identify specific generator fingerprints, using log-magnitude spectrograms or raw waveforms (Comanducci et al., 16 Sep 2024).

4. Performance, Robustness, and Generalization

In controlled, closed-set settings, amplitude-spectrogram models and transformer-based detectors exhibit high accuracy, e.g., 99.8% in (Afchar et al., 7 May 2024), F1=0.97 on SONICS (Rahman et al., 26 Aug 2024), and F1=1.00 on closed-set attribution in FakeMusicCaps (Comanducci et al., 16 Sep 2024). State-of-the-art dual-stream architectures (CLAM) achieve F1=0.925 on the OOD-focused MoM benchmark—a nearly 6 percentage-point improvement over prior SOTA (Batra et al., 29 Nov 2025).

However, robustness to common audio transformations remains an acute challenge. Pitch shift (±2\pm2 semitones), low-bitrate encoding (mp3/AAC/Opus 64 kbps), or additive noise can collapse detector performance to chance, with accuracy often <20%< 20\% on pitch-shifted or codec-altered fakes (Afchar et al., 17 Jan 2025, Sroka et al., 7 Jul 2025). Some manipulations (time-stretch, EQ, reverb) are better tolerated, but overall, base detectors show poor out-of-distribution robustness.

Generalization to unseen generators exposes a limit: intra-family transfer rates are high (>95%>95\%), but inter-family transfer rates on synthetic detection collapse to random guessing (Afchar et al., 7 May 2024, Afchar et al., 17 Jan 2025). In OOD-constructed benchmarks (MoM), only dual-stream and cross-modal detectors maintain operational performance (Batra et al., 29 Nov 2025).

Failure modes frequently include over-reliance on high-frequency artifacts, misclassifying silence as synthetic, and high sensitivity to pitch/timbre perturbations.

5. Attribution, Interpretation, and Recourse

Synthetic music forensics extends detection to attribution: determining which vocoder, generator, or processing pipeline produced a given excerpt. Methods include:

  • ResNet-18 classifiers on LFCC/MFCC/CQCC for vocoder attribution, achieving F1>>99.99% on closed sets of eight vocoders (Yan et al., 2022).
  • Multi-class spectrogram classifiers (e.g., ResNet18+Spec) for model attribution (e.g., distinguishing five TTM generators + real, perfect separation in closed-set (Comanducci et al., 16 Sep 2024)).
  • Patch-level saliency and Grad-CAM adapted to spectrograms for localizing source of synthetic artifacts in time-frequency (Afchar et al., 7 May 2024, Rahman et al., 26 Aug 2024).
  • Token-level attention visualizations and per-frame likelihoods in forensic reporting pipelines.

Recourse for flagged authentic samples includes providing saliency-attribution maps as evidence, maintaining transparent detection logs, enabling manual review and appeals, and publishing detector specifications and limitations for user scrutiny (Afchar et al., 7 May 2024).

Singer identification in high-quality vocal deepfakes is addressed by two-stage pipelines: a discriminator (LCNN) prunes low-quality forgeries, and an ECAPA-TDNN identifier assigns singer embeddings, yielding reduced Equal Error Rates on authentic and synthetic content (Salvi et al., 20 Oct 2025).

6. Symbolic Music Forensics and Plagiarism Detection

For symbolic AI-generated music, detection focuses on both generation-fingerprints and potential replication of training data, which constitutes a form of plagiarism. Key contributions:

  • JS Fake Chorales establishes human-in-the-loop benchmarks, with listeners achieving only 3.4% above chance at distinguishing synthetic vs. real Bach chorales, and documents that synthetic data can nearly match real for model training (Peracha, 2021).
  • The SSIMuse framework (Ji et al., 17 Sep 2025) adapts the Structural Similarity Index Measure (SSIM) from image analysis to music, defining SSIMuse-B and SSIMuse-V (for binary and velocity-based piano rolls). These metrics can detect exact 1-bar replications with true-positive rates >95%>95\% and false positives <5%<5\%. Controlled experiments in Pop1K7 and POP909 demonstrate quantitative thresholds and systematic performance.

A plausible implication is that forensic frameworks for symbolic music must couple multi-level similarity metrics (pitch, rhythm, dynamics) with controlled thresholding and domain-aware null distributions to avoid misclassifying quotation, stylistic borrowing, or arrangement as infringement.

7. Limitations, Challenges, and Future Directions

While laboratory accuracy on benchmark datasets is high, several limitations are prominent:

  • Real-world deployment is hampered by fragility to audio augmentation, transformation, and distribution shift (Sroka et al., 7 Jul 2025).
  • Black-box generator diversity forces rapid updating of datasets and retraining of detector architectures (Batra et al., 29 Nov 2025).
  • Closed-set classifier overfitting can yield high false negative rates for unknown, high-quality generators (Comanducci et al., 16 Sep 2024, Yan et al., 2022).
  • Current models may over-rely on low-level or model-dependent artifacts, requiring development of higher-level, musically interpretable detection cues (Afchar et al., 7 May 2024).
  • For symbolic forensics, calibration of replication thresholds is essential to balance detection against legitimate creative practice (Ji et al., 17 Sep 2025).

Future research directions include augmentation-aware training, adversarial robustness, continual learning frameworks, watermarking and provenance strategies, integration with legal, ethical, and policy mechanisms, and multimodal forensics that jointly analyze audio, symbolic, and metadata domains.


References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Synthetic Music Forensics.