Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spiking Music: Audio Compression with Event Based Auto-encoders (2402.01571v1)

Published 2 Feb 2024 in cs.SD, cs.LG, cs.NE, and eess.AS

Abstract: Neurons in the brain communicate information via punctual events called spikes. The timing of spikes is thought to carry rich information, but it is not clear how to leverage this in digital systems. We demonstrate that event-based encoding is efficient for audio compression. To build this event-based representation we use a deep binary auto-encoder, and under high sparsity pressure, the model enters a regime where the binary event matrix is stored more efficiently with sparse matrix storage algorithms. We test this on the large MAESTRO dataset of piano recordings against vector quantized auto-encoders. Not only does our "Spiking Music compression" algorithm achieve a competitive compression/reconstruction trade-off, but selectivity and synchrony between encoded events and piano key strikes emerge without supervision in the sparse regime.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. A unified, scalable framework for neural population decoding. arXiv preprint arXiv:2310.16046, 2023.
  2. End-to-end optimized image compression. arXiv preprint arXiv:1611.01704, 2016.
  3. Long short-term memory and learning-to-learn in networks of spiking neurons. Advances in neural information processing systems, 31, 2018.
  4. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):3625, 2020.
  5. Fitting summary statistics of neural data with a differentiable spiking network simulator. Advances in Neural Information Processing Systems, 34:18552–18563, 2021.
  6. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  7. I. BS. 1534-3,“method for the subjective assessment of intermediate quality level of audio systems,”. International Telecommunication Union, Geneva, Switzerland, 2015.
  8. Toward a unified theory of efficient, predictive, and sparse coding. Proceedings of the National Academy of Sciences, 115(1):186–191, 2018.
  9. Single channel voice separation for unknown number of speakers under reverberant and noisy settings. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3730–3734. IEEE, 2021.
  10. Temporal coding in spiking neural networks with alpha synaptic function: learning with backpropagation. IEEE transactions on neural networks and learning systems, 33(10):5939–5952, 2021a.
  11. Spiking autoencoders with temporal coding. Frontiers in neuroscience, 15:712667, 2021b.
  12. Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
  13. Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
  14. The heidelberg spiking data sets for the systematic evaluation of spiking neural networks. IEEE Transactions on Neural Networks and Learning Systems, 33(7):2744–2757, 2020.
  15. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  16. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023.
  17. Binary coding of speech spectrograms using a deep auto-encoder. In Eleventh annual conference of the international speech communication association, 2010.
  18. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  19. The challenge of realistic music generation: modelling raw audio at scale. Advances in Neural Information Processing Systems, 31, 2018.
  20. Variable-rate discrete representation learning. arXiv preprint arXiv:2103.06089, 2021.
  21. Yale sparse matrix package. i. the symmetric codes. Technical report, YALE UNIV NEW HAVEN CT DEPT OF COMPUTER SCIENCE, 1977.
  22. Convolutional networks for fast, energy-efficient neuromorphic computing. Proceedings of the National Academy of Sciences, 113(41):11441–11446, Sept. 2016. doi: 10.1073/pnas.1604850113. URL https://doi.org/10.1073/pnas.1604850113.
  23. Low bit-rate speech coding with vq-vae and a wavenet decoder. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 735–739. IEEE, 2019.
  24. Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014.
  25. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1lYRjC9F7.
  26. J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  27. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  28. H. Kajino. A differentiable point process with its application to spiking neural networks. In International Conference on Machine Learning, pages 5226–5235. PMLR, 2021.
  29. Q. Liang and Y. Zeng. Stylistic composition of melodies based on a brain-inspired spiking neural network. Frontiers in systems neuroscience, 15:639484, 2021.
  30. Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.
  31. W. Maass. On the computational power of noisy spiking neurons. Advances in neural information processing systems, 8, 1995.
  32. Fitting new speakers based on a short untranscribed sample. In International conference on machine learning, pages 3683–3691. PMLR, 2018.
  33. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
  34. The emergence of multiple retinal cell types through efficient coding of natural movies. Advances in Neural Information Processing Systems, 31, 2018.
  35. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
  36. Neocortex saves energy by reducing coding precision during food scarcity. Neuron, 110(2):280–296, 2022.
  37. An efficient and perceptually motivated auditory neural encoding and decoding algorithm for spiking neural networks. Frontiers in neuroscience, 13:1420, 2020.
  38. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
  39. Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. i. response characteristics. Journal of neurophysiology, 57(1):132–146, 1987.
  40. A. H. Robinson and C. Cherry. Results of a prototype television bandwidth compression scheme. Proceedings of the IEEE, 55(3):356–364, 1967.
  41. Mousai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
  42. Utilizing the neuronal behavior of spiking neurons to recognize music signals based on time coding features. IEEE Access, 10:37317–37329, 2022.
  43. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  44. Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing, 27(7):3210–3221, 2018.
  45. Trial matching: capturing variability with data-constrained spiking neural networks. Advances in neural information processing systems, 36, 2023.
  46. High-performance deep spiking neural networks with 0.3 spikes per neuron. Neural Networks, 168:74–88, 2023a.
  47. Are training trajectories of deep single-spike and deep relu network equivalent? arXiv preprint arXiv:2306.08744, 2023b.
  48. Efficient recurrent architectures through activity sparsity and sparse back-propagation through time. In The Eleventh International Conference on Learning Representations, 2022.
  49. Seanet: A multi-modal speech enhancement network. arXiv preprint arXiv:2009.02095, 2020.
  50. Speed of processing in the human visual system. nature, 381(6582):520–522, 1996.
  51. The intel neuromorphic dns challenge. Neuromorphic Computing and Engineering, 3(3):034005, 2023.
  52. Optimal population coding by noisy spiking neurons. Proceedings of the National Academy of Sciences, 107(32):14419–14424, 2010.
  53. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  54. T. C. Wunderlich and C. Pehle. Event-based backpropagation can compute exact gradients for spiking neural networks. Scientific Reports, 11(1):12829, 2021.
  55. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
Citations (1)

Summary

  • The paper introduces a novel method using a binary spiking auto-encoder that leverages high sparsity for efficient audio encoding.
  • It replaces conventional VQ mechanisms with an end-to-end trainable model that achieves competitive reconstruction quality on the MAESTRO dataset.
  • The study reveals emergent piano note selectivity, highlighting potential for energy-efficient neuromorphic computing in audio compression.

Event-based Audio Compression Exploiting Sparsity for Efficient Encoding

Introduction to Spiking Music Compression

Recent advances in deep learning have revitalized the exploration of neural network architectures for audio compression. However, the predominant approach employs vector quantized variational autoencoders (VQ-VAE), which do not inherently leverage the concept of event-based encoding—a principle deeply rooted in biological neural systems. This paper introduces a novel algorithm, coined as "Spiking Music compression," utilizing an event-based auto-encoder model for the task of audio compression. By replacing the VQ mechanism with a binary spiking model and imposing high sparsity pressure, the algorithm not only demonstrates competitive audio reconstruction quality but also inherent efficiency in encoding, storing, and transmitting musical data. Tested on the MAESTRO dataset, this approach significantly reduces the bit-rate required for piano recordings, offering a fresh perspective on digital compression techniques.

Novel Concept and Implementation

The cornerstone of this method is the development of a deep binary auto-encoder that transforms audio signals into a sparse binary matrix. Unlike conventional methods:

  • Free Model: It foregoes the requisite for a pre-defined codebook, relying instead on an end-to-end trainable model that outputs binary representations directly. In the absence of an auxiliary sparsity loss, this model competes with existing VQ-VAE techniques in audio fidelity.
  • Sparse Model: Incorporates sparsity-inducing loss to push the representation towards utilizing fewer bits, making the resultant binary matrix suitable for storage using sparse matrix algorithms. This model not only proves the viability of extremely low bit-rate compression but also reveals an emergent behavior of unit selectivity to specific piano notes without explicit supervision.

Theoretical Contributions and Practical Outcomes

A key theoretical insight from this research is the manifestation of selectivity and synchrony with piano key strikes in the sparse regime, suggesting that the model can uncover underlying musical patterns and events—a property not observed in the absence of sparsity pressure. This finding underpins the potential of spiking neural networks (SNNs) in encoding and compressing complex auditory signals beyond simple digitized data, providing a bridge to understanding how event-based systems can be harnessed effectively in computational models.

Forward-Looking Perspectives

The implications of this paper extend beyond the domain of audio compression, touching on broader aspects of efficient computing and neural network architecture design:

  • Energy Efficiency: The adoption of sparse, event-based encoding mirrors the energy-saving strategies observed in biological systems. In an era where computational efficiency, particularly in terms of energy consumption, is paramount, spiking models offer a lucrative avenue for the development of hardware and algorithms designed for sustainability.
  • Neuromorphic Computing: The investigation brings to light the untapped potential of SNNs in practical applications. The demonstrated computational advantages in audio compression set a precedent for leveraging spiking models in various tasks, potentially catalyzing advancements in neuromorphic computing landscapes.
  • Future Research Pathways: While the paper capitalizes on discrete time models, the exploration of continuous-time or event-driven frameworks for audio and other data types appears promising. Such approaches could further refine the efficiency and effectiveness of neural encoding systems.

Concluding Remarks

This research introduces a paradigm shift in neural audio compression, advocating for the exploration of sparsity and event-based encoding as means to achieve efficiency and performance. By demonstrating the practical utility of spiking models in compressing musical audio—and the emergent properties of such models—the paper sets the groundwork for future explorations into energy-efficient, neuromorphic computing technologies and beyond.