Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data-efficient Event Camera Pre-training via Disentangled Masked Modeling (2403.00416v1)

Published 1 Mar 2024 in cs.CV

Abstract: In this paper, we present a new data-efficient voxel-based self-supervised learning method for event cameras. Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams. In order to make our pre-training data-efficient, we first design a semantic-uniform masking method to address the learning imbalance caused by the varying reconstruction difficulties of different regions in non-uniform data when using random masking. In addition, we ease the traditional hybrid masked modeling process by explicitly decomposing it into two branches, namely local spatio-temporal reconstruction and global semantic reconstruction to encourage the encoder to capture local correlations and global semantics, respectively. This decomposition allows our selfsupervised learning method to converge faster with minimal pre-training data. Compared to previous approaches, our self-supervised learning method does not rely on paired RGB images, yet enables simultaneous exploration of spatial and temporal cues in multiple scales. It exhibits excellent generalization performance and demonstrates significant improvements across various tasks with fewer parameters and lower computational costs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Ev-segnet: Semantic segmentation for event-based cameras. In CVPRW, pages 1624–1633, 2019.
  2. A low power, fully event-based gesture recognition system. In CVPR, pages 7243–7252, 2017.
  3. Beit: BERT pre-training of image transformers. In ICLR, 2022.
  4. Graph-based spatio-temporal feature learning for neuromorphic vision sensing. IEEE TIP, pages 1–1, 2020.
  5. DDD17: end-to-end DAVIS driving dataset. CoRR, abs/1711.01458, 2017.
  6. A 240×\times× 180 130 db 3 μ𝜇\muitalic_μs latency global shutter spatiotemporal vision sensor. JSSC, 49(10):2333–2341, 2014.
  7. Asynchronous convolutional networks for object detection in neuromorphic cameras. In CVPRW, 2019.
  8. A differentiable recurrent surface for asynchronous event-based data. In ECCV, 2020.
  9. Structure-aware network for lane marker extraction with dynamic vision sensor. CoRR, abs/2008.06204, 2020.
  10. Mvf-net: A multi-view fusion network for event-based object classification. IEEE TCSVT, pages 1–1, 2021.
  11. A voxel graph cnn for object classification with event cameras. In CVPR, pages 1172–1181, 2022.
  12. A dynamic graph cnn with cross-representation distillation for event-based recognition. arXiv preprint arXiv:2302.04177, 2023.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  14. End-to-end learning of representations for asynchronous event-based data. In ICCV, pages 5633–5643, 2019.
  15. DSEC: A stereo event camera dataset for driving scenarios. RAL, 6(3):4947–4954, 2021.
  16. Self-supervised learning of event-based optical flow with spiking neural networks. In NeurIPS, pages 7167–7179. Curran Associates, Inc., 2021.
  17. Neuromorphic camera guided high dynamic range imaging. In CVPR, pages 1730–1739, 2020.
  18. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  19. Eventpoint: Self-supervised interest point detection and description for event-based camera. In WACV, pages 5396–5405, 2023.
  20. N-imagenet: Towards robust, fine-grained object recognition with event cameras. In ICCV, pages 2146–2156, 2021.
  21. Masked event modeling: Self-supervised pretraining for event cameras. arXiv preprint arXiv:2212.10368, 2022.
  22. Graph-based asynchronous event processing for rapid object recognition. In ICCV, pages 914–923, 2021.
  23. A 128×\times× 128 120 db 15 μ𝜇\muitalic_μs latency asynchronous temporal contrast vision sensor. JSSC, 43(2):566–576, 2008.
  24. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  25. Event-based asynchronous sparse convolutional networks. In ECCV, pages 415–431. Springer, 2020.
  26. Voxel-mae: Masked autoencoders for pre-training large-scale point clouds. arXiv preprint arXiv:2206.09900, 2022.
  27. Converting static image datasets to spiking neuromorphic datasets using saccades. CoRR, abs/1507.07629, 2015.
  28. Masked autoencoders for point cloud self-supervised learning. In ECCV, pages 604–621. Springer, 2022.
  29. Federico Paredes-Vallés and Guido CHE de Croon. Back to event basics: Self-supervised learning of image reconstruction for event cameras via photometric constancy. In CVPR, pages 3446–3455, 2021.
  30. Get: Group event transformer for event-based vision. In ICCV, pages 6038–6048, 2023.
  31. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
  32. High speed and high dynamic range video with an event camera. IEEE TPAMI, 43(6):1964–1980, 2019.
  33. Aegnn: Asynchronous event-based graph neural networks. In CVPR, 2022.
  34. ESS: learning event-based semantic segmentation from still images. In ECCV, pages 341–357. Springer, 2022.
  35. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In NeurIPS, pages 10078–10093, 2022.
  36. Event enhanced high-quality image recovery. In ECCV, pages 155–171. Springer, 2020.
  37. Unsupervised video deraining with an event camera. In ICCV, pages 10831–10840, 2023a.
  38. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023b.
  39. Masked feature prediction for self-supervised visual pre-training. In CVPR, pages 14668–14678, 2022.
  40. Liaf-net: Leaky integrate and analog fire network for lightweight and efficient spatiotemporal information processing. TNNLS, 33(11):6249–6262, 2022.
  41. Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification. RAL, 7(2):1976–1983, 2022a.
  42. Simmim: A simple framework for masked image modeling. In CVPR, pages 9653–9663, 2022b.
  43. Gd-mae: generative decoder for mae pre-training on lidar point clouds. In CVPR, pages 9403–9414, 2023a.
  44. Event camera data pre-training. ICCV, 2023b.
  45. Temporal-wise attention spiking neural networks for event streams classification. In ICCV, pages 10221–10230, 2021.
  46. Point-m2ae: Multi-scale masked autoencoders for hierarchical point cloud pre-training. In NeurIPS, pages 27061–27074, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhenpeng Huang (5 papers)
  2. Chao Li (429 papers)
  3. Hao Chen (1006 papers)
  4. Yongjian Deng (11 papers)
  5. Yifeng Geng (30 papers)
  6. Limin Wang (221 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.