SONICS: Synthetic Or Not -- Identifying Counterfeit Songs (2408.14080v3)

Published 26 Aug 2024 in cs.SD, cs.AI, cs.CV, cs.LG, and eess.AS

Abstract: The recent surge in AI-generated songs presents exciting possibilities and challenges. While these inventions democratize music creation, they also necessitate the ability to distinguish between human-composed and synthetic songs to safeguard artistic integrity and protect human musical artistry. Existing research and datasets in fake song detection only focus on singing voice deepfake detection (SVDD), where the vocals are AI-generated but the instrumental music is sourced from real songs. However, these approaches are inadequate for detecting contemporary end-to-end artificial songs where all components (vocals, music, lyrics, and style) could be AI-generated. Additionally, existing datasets lack music-lyrics diversity, long-duration songs, and open-access fake songs. To address these gaps, we introduce SONICS, a novel dataset for end-to-end Synthetic Song Detection (SSD), comprising over 97k songs (4,751 hours) with over 49k synthetic songs from popular platforms like Suno and Udio. Furthermore, we highlight the importance of modeling long-range temporal dependencies in songs for effective authenticity detection, an aspect entirely overlooked in existing methods. To utilize long-range patterns, we introduce SpecTTTra, a novel architecture that significantly improves time and memory efficiency over conventional CNN and Transformer-based models. In particular, for long audio samples, our top-performing variant outperforms ViT by 8% F1 score while being 38% faster and using 26% less memory. Additionally, in comparison with ConvNeXt, our model achieves 1% gain in F1 score with 20% boost in speed and 67% reduction in memory usage. Other variants of our model family provide even better speed and memory efficiency with competitive performance.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces the SONICS dataset, which overcomes deepfake detection limitations by including over 97,000 songs with 49,000 synthetic tracks.
The paper presents the SpecTTTra model that captures long-range dependencies with up to 3 times faster processing and 6 times memory efficiency, achieving an F1 score of 0.94 on long songs.
The paper benchmarks both human and AI detection approaches, highlighting practical implications for copyright verification and potential cross-linguistic and multimodal extensions.

SONICS: Synthetic Or Not - Identifying Counterfeit Songs

Summary

This paper addresses a critical gap in current research on AI-generated songs, an area gaining significant importance with the proliferation of AI tools capable of generating entire music tracks. Existing efforts in counterfeit song detection primarily target Singing Voice Deepfake Detection (SVDD), focusing on synthetic vocals overlaid on real instrumental tracks. However, these approaches fall short when faced with contemporary end-to-end AI-generated songs where all elements—vocals, lyrics, music, and style—are synthesized. The authors introduce SONICS, a substantial dataset designed for Synthetic Song Detection (SSD), which includes over 97,000 songs, of which approximately 49,000 are synthetic, sourced from platforms like Suno and Udio.

Key Contributions

Dataset Introduction: The paper presents SONICS, which alleviates many limitations of existing datasets. It includes a broad diversity of music and lyrics, long-duration songs, and ensures public availability of fake songs, mitigating the "Karaoke effect" and enhancing the practical utility for SSD tasks.
SpecTTTra Model: The authors propose a novel model, Spectro-Temporal Tokens Transformer (SpecTTTra), designed to capture long-range temporal dependencies in music. Notably, the model is described as being up to 3 times faster and 6 times more memory-efficient than current CNN and Transformer-based models, while maintaining competitive performance metrics.
Human and AI Benchmarks: The paper includes benchmarks for both AI-based and human evaluation of synthetic song detection, providing a comprehensive analysis of performance across diverse conditions.

Numerical Results and Performance Insights

The presented results indicate that incorporating long-context relationships significantly enhances fake song detection. When evaluated on long-duration songs (120 sec), the proposed SpecTTTra model variants substantially outperformed their performance on short-duration songs (5 sec). Notably, the SpecTTTra- $\alpha$ variant achieved an F1 score of 0.94 on long songs, only 2% less than the top-performing CNN-based model, ConvNeXt, signifying its potential for efficient and accurate long-sequence analysis.

Dataset and Model Analysis

The SONICS dataset distinguishes itself with its significant scale and diversity. It includes songs generated by various iterations of the Suno and Udio models, reflecting a wide array of artistic and stylistic compositions. This diversity is likely to better train and evaluate models that must generalize across different types of AI-generated music.

Meanwhile, SpecTTTra's design effectively mitigates the trade-offs between capturing long-range dependencies and computational efficiency. Traditional Vision Transformers (ViTs) become impractically computationally expensive for long audio inputs due to the rapid growth in patch numbers. SpecTTTra circumvents this by separately tokenizing temporal and spectral information, resulting in a more controlled increase in token count and hence, lower computational costs.

Implications and Future Directions

The work presented in this paper has several important implications:

Enhanced Detection Capabilities: By demonstrating the effectiveness of modeling long-range dependencies, the paper suggests that future work should continue to explore and refine techniques capable of leveraging these relationships in music.
Practical Applications: The development of robust SSD systems, as outlined in this paper, may lead to practical tools for verifying the authenticity of music tracks. This could be pivotal for platforms dealing with copyright and intellectual property concerns.
Cross-linguistic and Multimodal Extensions: The current dataset focuses exclusively on English songs. Expanding this research to include multiple languages and integrating multimodal data (e.g., video) could create even more resilient detection systems.
Real-world Adoption: While the dataset offers a comprehensive benchmark, real-world adoption of these detection models will likely require continuous updates to handle evolving generative technologies.

Conclusion

This paper provides a significant advancement in the field of AI-generated song detection through the introduction of the SONICS dataset and the innovative SpecTTTra model. The thoughtful consideration of long-context modeling and the comprehensive benchmarking establish a strong foundation for future research in this domain, promoting the development of more sophisticated and efficient detection systems to ensure the integrity and authenticity of musical compositions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ArxivSound/status/1828609345753407856

https://twitter.com/permutans/status/1870449016179462572