Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection

Published 9 Apr 2026 in cs.CL | (2604.08104v1)

Abstract: We propose Quantum Vision (QV) theory as a new perspective for deep learning-based audio classification, applied to deepfake speech detection. Inspired by particle-wave duality in quantum physics, QV theory is based on the idea that data can be represented not only in its observable, collapsed form, but also as information waves. In conventional deep learning, models are trained directly on these collapsed representations, such as images. In QV theory, inputs are first transformed into information waves using a QV block, and then fed into deep learning models for classification. QV-based models improve performance in image classification compared to their non-QV counterparts. What if QV theory is applied speech spectrograms for audio classification tasks? This is the motivation and novelty of the proposed approach. In this work, Short-Time Fourier Transform (STFT), Mel-spectrograms, and Mel-Frequency Cepstral Coefficients (MFCC) of speech signals are converted into information waves using the proposed QV block and used to train QV-based Convolutional Neural Networks (QV-CNN) and QV-based Vision Transformers (QV-ViT). Extensive experiments are conducted on the ASVSpoof dataset for deepfake speech classification. The results show that QV-CNN and QV-ViT consistently outperform standard CNN and ViT models, achieving higher classification accuracy and improved robustness in distinguishing genuine and spoofed speech. Moreover, the QV-CNN model using MFCC features achieves the best overall performance on the ASVspoof dataset, with an accuracy of 94.20% and an EER of 9.04%, while the QV-CNN with Mel-spectrograms attains the highest accuracy of 94.57%. These findings demonstrate that QV theory is an effective and promising approach for audio deepfake detection and opens new directions for quantum-inspired learning in audio perception tasks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a quantum-inspired approach that reconstructs spectrograms into information waves, enhancing deepfake speech detection accuracy.
It implements QV-CNN and QV-ViT models on the ASVspoof dataset, achieving up to 94.57% accuracy and lower EER compared to conventional methods.
The method leverages particle-wave duality to preserve higher-order audio features, resulting in robust feature extraction and improved generalization.

Quantum Vision Theory for Audio Classification: Application to Deepfake Speech Detection

Introduction

"Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection" (2604.08104) advances a novel paradigm for audio-based deepfake detection by extending Quantum Vision (QV) theory into the speech forensics domain. The framework reconsiders the foundational representation of sensory data: instead of using fixed, collapsed spectrogram images typical in deep learning, QV theory translates these into a suite of quantum-inspired information waves before classification. By leveraging fundamental concepts such as particle–wave duality from quantum mechanics, the approach enables deep neural architectures to exploit richer, wave-like feature spaces for robust pattern discrimination. The proposed models are validated on the ASVspoof dataset, demonstrating measurable improvements over established architectures in both accuracy and Equal Error Rate (EER).

Quantum Vision Theory: Motivation and Mathematical Formulation

The QV framework is inspired by quantum mechanics: unobserved quantum systems are described by wave functions encoding the probability distribution of all possible states. Classical deep learning, in contrast, relies on deterministic, collapsed representations (e.g., pixel arrays of images or spectrograms). QV theory asserts that an analogous “information wave” can be constructed from data before observation (classification), preserving information that is lost upon direct collapse.

From the mathematical perspective, a spectrogram $I(x,y)$ is treated as a quantum object. The foundation of the QV block is the systematic generation of basis wave functions by spatially shifting $I(x,y)$ along the $x$ and $y$ axes (via quantum numbers $m = \pm1, \pm2$ ), followed by difference operations that highlight local transitions, analogous to derivatives. Superpositions are formed using both linear combinations (Eq. 3) and non-linear combinations (Eq. 4, 5, 6) through convolution and nonlinearity (ReLU). Multiple convolutional layers (and filter banks) yield high-dimensional information wave embeddings with rich expressive power, which are then provided as input to standard CNN or Vision Transformer (ViT) architectures—producing QV-CNN and QV-ViT models.

Experimental Design

Spectrogram-based representations (STFT, Mel, MFCC) are extracted from ASVspoof 2019. Feature extraction parameters are standardized (16 kHz resampling, 1024-sample window, 256-sample hop). All resulting 2D spectrograms are resized to $32 \times 32$ for compatibility.

QV-CNN: A deep sequence of six convolutional layers processes 128-dimensional QV information waves, with standard normalization, pooling, and fully-connected classification.
QV-ViT: The transformer backbone consumes 128 QV feature maps as input tokens, with patch size $8 \times 8$ , eight transformer layers, multi-head self-attention, and MLP classifier head.

Training is performed using consistent data splits for rigorous comparison, and evaluated against baselines (conventional CNN and ViT without QV transformation). Metrics include classification accuracy and EER.

Results

Performance Across Feature Types and Model Variants

Integrating the QV block yields consistent improvements:

CNN-based models: QV-CNN surpasses standard CNN on all representations.
- STFT: 93.26% accuracy, 11.65% EER at batch size 64.
- Mel-spectrogram: 94.57% accuracy (highest overall), 10.84% EER.
- MFCC: 94.20% accuracy, 9.04% EER (lowest overall).
ViT-based models: QV-ViT exhibits stable improvements, lessening the EER gap with CNN-based approaches.
- Mel and MFCC features benefit strongly (e.g., up to 93.49% accuracy, 9.76% EER).

QV-enhanced models also demonstrate reduced performance variability with changes in batch size, indicating robustness. Mel-spectrogram and MFCC features offer superior discriminative power compared with basic STFT.

Comparative Analysis with Prior Work

A tabulated comparison against published results on ASVspoof 2019 establishes that QV-CNN (with MFCC) outperforms recent deep/backbone approaches—achieving lower EER and higher classification accuracy compared to, e.g., ResNet, LCNN, LSTM-CNN, and transformer-based frameworks [altalahin2023unmasking, cheng2023analysis, ulutas2023deepfake, bartusiak2021synthesized].

Numerical highlights:

Model	Features	Accuracy (%)	EER (%)
QV-CNN	Mel	94.57	10.84
QV-CNN	MFCC	94.20	9.04
Spectrogram+CNN	[nosek2019synthesized]	--	9.57
MFCC+ResNet	[alzantot2019deep]	--	9.33
STM+LCNN	[cheng2023analysis]	--	9.79

Theoretical and Practical Implications

The results substantiate the core hypothesis that standard spectrogram representations, as collapsed observations, discard discriminative cues from the generative process of the underlying signal. By constructing a quantum-inspired information wave space via the QV block, the model can learn features akin to higher-order derivatives and spectro-temporal transitions, facilitating improved spoof discrimination.

From a broader theoretical standpoint, QV theory provides a compelling direction for data representation in deep learning by introducing explicit connection to quantum principles. This can be generically extended to other sensory modalities—computational vision, multimodal perception, and time-series analysis. Practically, the framework does not require quantum hardware; rather, it is a constructive, mathematically motivated pre-processing applied prior to classical architectures, yielding immediate benefit with modest computational overhead.

Future Directions

Further research is motivated by the strong empirical gains: avenues include upscaling QV blocks to larger audio corpora, generalization to multilingual or zero-shot settings, multi-modal (audio-visual) fusion, and integration with generative models for enhanced adversarial robustness.

Conclusion

This work demonstrates that quantum-inspired information wave representations, via the QV block, significantly enhance feature learning for deepfake speech detection. By bridging the gap between quantum measurement formalism and contemporary deep feature extraction, the proposed framework outperforms several existing baselines on ASVspoof 2019 benchmarks. QV theory thus constitutes a substantial new paradigm for representation learning in neural audio forensics and beyond (2604.08104).

Markdown Report Issue