Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Hybrid ANN-SNN Architecture for Low-Power and Low-Latency Visual Perception (2303.14176v2)

Published 24 Mar 2023 in cs.CV and cs.AI

Abstract: Spiking Neural Networks (SNN) are a class of bio-inspired neural networks that promise to bring low-power and low-latency inference to edge devices through asynchronous and sparse processing. However, being temporal models, SNNs depend heavily on expressive states to generate predictions on par with classical artificial neural networks (ANNs). These states converge only after long transient periods, and quickly decay without input data, leading to higher latency, power consumption, and lower accuracy. This work addresses this issue by initializing the state with an auxiliary ANN running at a low rate. The SNN then uses the state to generate predictions with high temporal resolution until the next initialization phase. Our hybrid ANN-SNN model thus combines the best of both worlds: It does not suffer from long state transients and state decay thanks to the ANN, and can generate predictions with high temporal resolution, low latency, and low power thanks to the SNN. We show for the task of event-based 2D and 3D human pose estimation that our method consumes 88% less power with only a 4% decrease in performance compared to its fully ANN counterparts when run at the same inference rate. Moreover, when compared to SNNs, our method achieves a 74% lower error. This research thus provides a new understanding of how ANNs and SNNs can be used to maximize their respective benefits.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Multi-view pictorial structures for 3d human pose estimation. In British Mach. Vis. Conf. (BMVC), volume 1. Bristol, UK, 2013.
  2. Time-ordered recent event (tore) volumes for event cameras. IEEE Trans. Pattern Anal. Mach. Intell., 2022.
  3. Dhp19: Dynamic vision sensor 3d human pose dataset. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1695–1704, 2019.
  4. DHP19: Dynamic vision sensor 3D human pose dataset. In IEEE Conf. Comput. Vis. Pattern Recog. Workshops (CVPRW), 2019.
  5. Spiking deep convolutional neural networks for energy-efficient object recognition. Int. J. Comput. Vis., 113(1):54–66, 2015.
  6. Low-latency event-based visual odometry. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 703–710. IEEE, 2014.
  7. 3d human pose estimation= 2d pose estimation+ matching. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 7035–7043, 2017.
  8. Monocular human pose estimation: A survey of deep learning-based methods. Computer Vision and Image Understanding, 192:102897, 2020.
  9. Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 3810–3818, 2015.
  10. Spikingjelly. https://github.com/fangwei123456/spikingjelly, 2020. Accessed: 2022-11-18.
  11. Deep residual learning in spiking neural networks. Conf. Neural Inf. Process. Syst. (NeurIPS), 34:21056–21069, 2021.
  12. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  13. Video to events: Recycling video datasets for event cameras. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), June 2020.
  14. End-to-end learning of representations for asynchronous event-based data. In Int. Conf. Comput. Vis. (ICCV), 2019.
  15. Neuronal dynamics: From single neurons to networks and models of cognition. Cambridge University Press, 2014.
  16. Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 13558–13567, 2020.
  17. Multi-view 3d human pose estimation in complex environment. Int. J. Comput. Vis., 96:103–124, 2012.
  18. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1325–1339, 2014.
  19. Learnable triangulation of human pose. In Int. Conf. Comput. Vis. (ICCV), 2019.
  20. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 9000–9008, 2018.
  21. Ten lessons from three generations shaped google’s tpuv4i : Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14, 2021.
  22. Learning latent representations of 3d human pose with deep neural networks. Int. J. Comput. Vis., 126:1326–1341, 2018.
  23. Adam: A method for stochastic optimization. Int. Conf. Learn. Representations (ICLR), 2015.
  24. Self-supervised learning of 3d human pose using multi-view geometry. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 1077–1086, 2019.
  25. Hybrid snn-ann: Energy-efficient classification and object detection for event-based vision. In DAGM German Conference on Pattern Recognition, 2021.
  26. Spike-thrift: Towards energy-efficient deep spiking neural networks by limiting spiking activity via attention-guided compression. In IEEE Winter Conf. Appl. Comput. Vis. (WACV), pages 3953–3962, 2021.
  27. Spike-flownet: event-based optical flow estimation with energy-efficient hybrid neural networks. In Eur. Conf. Comput. Vis. (ECCV), pages 366–382. Springer, 2020.
  28. Bio-mimetic high-speed target localization with fused frame and event vision for edge application. Frontiers in Neuroscience, 16, 2022.
  29. A free lunch from ann: Towards efficient, accurate spiking neural networks calibration. In Proc. Int. Conf. Mach. Learning (ICML), pages 6316–6325. PMLR, 2021.
  30. Enhancing spiking neural networks with hybrid top-down attention. Frontiers in Neuroscience, 16, 2022.
  31. Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 5064–5073, 2020.
  32. Enhanced 3d human pose estimation from videos by using attention-based neural network with dilated convolutions. Int. J. Comput. Vis., 129:1596–1615, 2021.
  33. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
  34. Event-based vision meets deep learning on steering prediction for self-driving cars. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 5419–5427, 2018.
  35. Single-shot multi-person 3d pose estimation from monocular rgb. In 3D Vision (3DV), pages 120–130. IEEE, 2018.
  36. Vnect: Real-time 3d human pose estimation with a single rgb camera. Acm transactions on graphics (tog), 36(4):1–14, 2017.
  37. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
  38. On the difficulty of training recurrent neural networks. In Proc. Int. Conf. Mach. Learning (ICML), pages 1310–1318. PMLR, 2013.
  39. Pytorch: An imperative style, high-performance deep learning library. Conf. Neural Inf. Process. Syst. (NeurIPS), 32, 2019.
  40. Coarse-to-fine volumetric prediction for single-image 3d human pose. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 7025–7034, 2017.
  41. Spiking neural networks with improved inherent recurrence dynamics for sequential learning. In AAAI Conf. Artificial Intell., volume 36, pages 8001–8008, 2022.
  42. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. arXiv preprint arXiv:2005.01807, 2020.
  43. General automatic human shape and motion capture using volumetric contour cues. In Eur. Conf. Comput. Vis. (ECCV), pages 509–526. Springer, 2016.
  44. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.
  45. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015.
  46. Lifting monocular events to 3d human poses. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 1358–1368, 2021.
  47. Going deeper in spiking neural networks: Vgg and residual architectures. Frontiers in neuroscience, 13:95, 2019.
  48. Lifting from the deep: Convolutional 3d pose estimation from a single image. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 2500–2509, 2017.
  49. P.J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.
  50. Eventcap: Monocular 3d capture of high-speed human motions using an event camera. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 4968–4978, 2020.
  51. Dashnet: a hybrid artificial and spiking neural network for high-speed object tracking. arXiv preprint arXiv:1909.12942, 2019.
  52. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153, 2019.
  53. A framework for the general design and computation of hybrid neural networks. Nature communications, 13(1):1–12, 2022.
  54. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 398–407, 2017.
  55. Event-based video reconstruction via potential-assisted spiking neural network. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 3594–3604, 2022.
  56. Eventhpe: Event-based 3d human pose and shape estimation. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pages 10996–11005, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Asude Aydin (2 papers)
  2. Mathias Gehrig (23 papers)
  3. Daniel Gehrig (28 papers)
  4. Davide Scaramuzza (190 papers)
Citations (4)