Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
38 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

DAC-JAX: A JAX Implementation of the Descript Audio Codec (2405.11554v1)

Published 19 May 2024 in cs.SD and eess.AS

Abstract: We present an open-source implementation of the Descript Audio Codec (DAC) using Google's JAX ecosystem of Flax, Optax, Orbax, AUX, and CLU. Our codebase enables the reuse of model weights from the original PyTorch DAC, and we confirm that the two implementations produce equivalent token sequences and decoded audio if given the same input. We provide a training and fine-tuning script which supports device parallelism, although we have only verified it using brief training runs with a small dataset. Even with limited GPU memory, the original DAC can compress or decompress a long audio file by processing it as a sequence of overlapping "chunks." We implement this feature in JAX and benchmark the performance on two types of GPUs. On a consumer-grade GPU, DAC-JAX outperforms the original DAC for compression and decompression at all chunk sizes. However, on a high-performance, cluster-based GPU, DAC-JAX outperforms the original DAC for small chunk sizes but performs worse for large chunks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
  1. R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  2. N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
  3. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  4. S.-W. Fu, K.-H. Hung, Y. Tsao, and Y.-C. F. Wang, “Self-supervised speech quality estimation and enhancement using only clean speech,” arXiv preprint arXiv:2402.16321, 2024.
  5. C. Wang et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  6. A. Agostinelli et al., “MusicLM: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
  7. J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  8. H. F. Garcia, P. Seetharaman, R. Kumar, and B. Pardo, “VampNet: Music Generation via Masked Acoustic Token Modeling,” Jul. 2023, arXiv:2307.04686 [cs, eess]. [Online]. Available: http://arxiv.org/abs/2307.04686
  9. B. Kuznetsov, “SNAX,” https://github.com/boris-kuz/SNAX, 2024.
  10. H. Siuzdak, “SNAC: Multi-Scale Neural Audio Codec,” Feb. 2024. [Online]. Available: https://github.com/hubertsiuzdak/snac
  11. C. J. Steinmetz and J. D. Reiss, “pyloudnorm: A simple yet flexible loudness meter in python,” in 150th AES Convention, 2021.
  12. B. Kuznetsov, “jaxloudnorm,” https://github.com/boris-kuz/jaxloudnorm, 2024.
  13. Y. Orlarey, D. Fober, and S. Letz, “FAUST: an efficient functional approach to dsp programming,” New Computational Paradigms for Computer Music, pp. 65–96, 2009.
  14. D. Braun, “DawDreamer: Bridging the gap between digital audio workstations and python interfaces,” arXiv preprint arXiv:2111.09931, 2021.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)