- The paper introduces a quantum-inspired approach that reconstructs spectrograms into information waves, enhancing deepfake speech detection accuracy.
- It implements QV-CNN and QV-ViT models on the ASVspoof dataset, achieving up to 94.57% accuracy and lower EER compared to conventional methods.
- The method leverages particle-wave duality to preserve higher-order audio features, resulting in robust feature extraction and improved generalization.
Quantum Vision Theory for Audio Classification: Application to Deepfake Speech Detection
Introduction
"Quantum Vision Theory Applied to Audio Classification for Deepfake Speech Detection" (2604.08104) advances a novel paradigm for audio-based deepfake detection by extending Quantum Vision (QV) theory into the speech forensics domain. The framework reconsiders the foundational representation of sensory data: instead of using fixed, collapsed spectrogram images typical in deep learning, QV theory translates these into a suite of quantum-inspired information waves before classification. By leveraging fundamental concepts such as particle–wave duality from quantum mechanics, the approach enables deep neural architectures to exploit richer, wave-like feature spaces for robust pattern discrimination. The proposed models are validated on the ASVspoof dataset, demonstrating measurable improvements over established architectures in both accuracy and Equal Error Rate (EER).
The QV framework is inspired by quantum mechanics: unobserved quantum systems are described by wave functions encoding the probability distribution of all possible states. Classical deep learning, in contrast, relies on deterministic, collapsed representations (e.g., pixel arrays of images or spectrograms). QV theory asserts that an analogous “information wave” can be constructed from data before observation (classification), preserving information that is lost upon direct collapse.
From the mathematical perspective, a spectrogram I(x,y) is treated as a quantum object. The foundation of the QV block is the systematic generation of basis wave functions by spatially shifting I(x,y) along the x and y axes (via quantum numbers m=±1,±2), followed by difference operations that highlight local transitions, analogous to derivatives. Superpositions are formed using both linear combinations (Eq. 3) and non-linear combinations (Eq. 4, 5, 6) through convolution and nonlinearity (ReLU). Multiple convolutional layers (and filter banks) yield high-dimensional information wave embeddings with rich expressive power, which are then provided as input to standard CNN or Vision Transformer (ViT) architectures—producing QV-CNN and QV-ViT models.
Experimental Design
Spectrogram-based representations (STFT, Mel, MFCC) are extracted from ASVspoof 2019. Feature extraction parameters are standardized (16 kHz resampling, 1024-sample window, 256-sample hop). All resulting 2D spectrograms are resized to 32×32 for compatibility.
- QV-CNN: A deep sequence of six convolutional layers processes 128-dimensional QV information waves, with standard normalization, pooling, and fully-connected classification.
- QV-ViT: The transformer backbone consumes 128 QV feature maps as input tokens, with patch size 8×8, eight transformer layers, multi-head self-attention, and MLP classifier head.
Training is performed using consistent data splits for rigorous comparison, and evaluated against baselines (conventional CNN and ViT without QV transformation). Metrics include classification accuracy and EER.
Results
Integrating the QV block yields consistent improvements:
- CNN-based models: QV-CNN surpasses standard CNN on all representations.
- STFT: 93.26% accuracy, 11.65% EER at batch size 64.
- Mel-spectrogram: 94.57% accuracy (highest overall), 10.84% EER.
- MFCC: 94.20% accuracy, 9.04% EER (lowest overall).
- ViT-based models: QV-ViT exhibits stable improvements, lessening the EER gap with CNN-based approaches.
- Mel and MFCC features benefit strongly (e.g., up to 93.49% accuracy, 9.76% EER).
QV-enhanced models also demonstrate reduced performance variability with changes in batch size, indicating robustness. Mel-spectrogram and MFCC features offer superior discriminative power compared with basic STFT.
Comparative Analysis with Prior Work
A tabulated comparison against published results on ASVspoof 2019 establishes that QV-CNN (with MFCC) outperforms recent deep/backbone approaches—achieving lower EER and higher classification accuracy compared to, e.g., ResNet, LCNN, LSTM-CNN, and transformer-based frameworks [altalahin2023unmasking, cheng2023analysis, ulutas2023deepfake, bartusiak2021synthesized].
Numerical highlights:
| Model |
Features |
Accuracy (%) |
EER (%) |
| QV-CNN |
Mel |
94.57 |
10.84 |
| QV-CNN |
MFCC |
94.20 |
9.04 |
| Spectrogram+CNN |
[nosek2019synthesized] |
-- |
9.57 |
| MFCC+ResNet |
[alzantot2019deep] |
-- |
9.33 |
| STM+LCNN |
[cheng2023analysis] |
-- |
9.79 |
Theoretical and Practical Implications
The results substantiate the core hypothesis that standard spectrogram representations, as collapsed observations, discard discriminative cues from the generative process of the underlying signal. By constructing a quantum-inspired information wave space via the QV block, the model can learn features akin to higher-order derivatives and spectro-temporal transitions, facilitating improved spoof discrimination.
From a broader theoretical standpoint, QV theory provides a compelling direction for data representation in deep learning by introducing explicit connection to quantum principles. This can be generically extended to other sensory modalities—computational vision, multimodal perception, and time-series analysis. Practically, the framework does not require quantum hardware; rather, it is a constructive, mathematically motivated pre-processing applied prior to classical architectures, yielding immediate benefit with modest computational overhead.
Future Directions
Further research is motivated by the strong empirical gains: avenues include upscaling QV blocks to larger audio corpora, generalization to multilingual or zero-shot settings, multi-modal (audio-visual) fusion, and integration with generative models for enhanced adversarial robustness.
Conclusion
This work demonstrates that quantum-inspired information wave representations, via the QV block, significantly enhance feature learning for deepfake speech detection. By bridging the gap between quantum measurement formalism and contemporary deep feature extraction, the proposed framework outperforms several existing baselines on ASVspoof 2019 benchmarks. QV theory thus constitutes a substantial new paradigm for representation learning in neural audio forensics and beyond (2604.08104).