Segment Any Events via Weighted Adaptation of Pivotal Tokens
Abstract: In this paper, we delve into the nuanced challenge of tailoring the Segment Anything Models (SAMs) for integration with event data, with the overarching objective of attaining robust and universal object segmentation within the event-centric domain. One pivotal issue at the heart of this endeavor is the precise alignment and calibration of embeddings derived from event-centric data such that they harmoniously coincide with those originating from RGB imagery. Capitalizing on the vast repositories of datasets with paired events and RGB images, our proposition is to harness and extrapolate the profound knowledge encapsulated within the pre-trained SAM framework. As a cornerstone to achieving this, we introduce a multi-scale feature distillation methodology. This methodology rigorously optimizes the alignment of token embeddings originating from event data with their RGB image counterparts, thereby preserving and enhancing the robustness of the overall architecture. Considering the distinct significance that token embeddings from intermediate layers hold for higher-level embeddings, our strategy is centered on accurately calibrating the pivotal token embeddings. This targeted calibration is aimed at effectively managing the discrepancies in high-level embeddings originating from both the event and image domains. Extensive experiments on different datasets demonstrate the effectiveness of the proposed distillation method. Code in http://github.com/happychenpipi/EventSAM.
- Event-based object detection and tracking for space situational awareness. IEEE Sensors Journal, 20(24):15117–15132, 2020.
- Ev-segnet: Semantic segmentation for event-based cameras. In CVPRW, pages 0–0, 2019.
- Deep semantic segmentation of natural and medical images: a review. Artificial Intelligence Review, 54:137–178, 2021.
- Multimae: Multi-modal multi-task masked autoencoders. In ECCV, pages 348–367. Springer, 2022.
- High-speed tracking-by-detection without using image information. In AVSS, pages 1–6. IEEE, 2017.
- Yolact: Real-time instance segmentation. In ICCV, pages 9157–9166, 2019.
- A 240×\times× 180 130 db 3 μ𝜇\muitalic_μs latency global shutter spatiotemporal vision sensor. IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014.
- Language models are few-shot learners. NIPS, 33:1877–1901, 2020.
- Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, pages 12475–12485, 2020.
- Leveraging large language models for pre-trained recommender systems. arXiv preprint arXiv:2308.10837, 2023.
- Amae: Adaptive motion-agnostic encoder for event-based object classification. IRAL, 5(3):4596–4603, 2020.
- Learning from images: A distillation learning framework for event cameras. IEEE TIP, 30:4919–4931, 2021.
- Towards low-latency high-bandwidth control of quadrotors using event cameras. In ICRA, pages 4294–4300. IEEE, 2020.
- Compact trilinear interaction for visual question answering. In ICCV, pages 392–401, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
- Focus is all you need: Loss functions for event-based vision. In CVPR, pages 12280–12289, 2019.
- Event-based vision: A survey. IEEE TPAMI, 44(1):154–180, 2020.
- Modality distillation with multiple stream networks for action recognition. In ECCV, pages 103–118, 2018.
- A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857, 2017.
- A survey on deep learning techniques for image and video semantic segmentation. Applied Soft Computing, 70:41–65, 2018.
- Recurrent vision transformers for object detection with event cameras. In CVPR, pages 13884–13893, 2023.
- Knowledge distillation: A survey. IJCV, 129:1789–1819, 2021.
- Hierarchical multi-attention transfer for knowledge distillation. ACM TMC, 20(2):1–20, 2023.
- Cross modal distillation for supervision transfer. In CVPR, pages 2827–2836, 2016.
- Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
- On the effectiveness of adapter-based tuning for pretrained language model adaptation. arXiv preprint arXiv:2106.03164, 2021.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Creating something from nothing: Unsupervised knowledge distillation for cross-modal hashing. In CVPR, pages 3123–3132, 2020a.
- Learning to exploit multiple vision modalities by using grafted networks. In ECCV, pages 85–101. Springer, 2020b.
- Event-based video frame interpolation with cross-modal asymmetric bidirectional motion fields. In CVPR, pages 18032–18042, 2023.
- Panoptic segmentation. In CVPR, pages 9404–9413, 2019.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- Centermask: Real-time anchor-free instance segmentation. In CVPR, pages 13906–13915, 2020.
- Asynchronous spatio-temporal memory network for continuous event-based object detection. IEEE TIP, 31:2975–2987, 2022.
- Coherent event guided low-light video enhancement. In ICCV, pages 10615–10625, 2023.
- Path aggregation network for instance segmentation. In CVPR, pages 8759–8768, 2018.
- Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
- Multi-bracket high dynamic range imaging with event cameras. In CVPR, pages 547–557, 2022.
- Event-based moving object detection and tracking. In IROS, pages 1–9. IEEE, 2018.
- Learning visual motion segmentation using event surfaces. In CVPR, pages 14414–14423, 2020.
- Equivariant adaptation of large pre-trained models. arXiv preprint arXiv:2310.01647, 2023.
- Learning deconvolution network for semantic segmentation. In ICCV, pages 1520–1528, 2015.
- Computation of the stationary distribution of a markov chain. Journal of Statistical Computation and Simulation, 4(3):173–186, 1975.
- High frame rate video reconstruction based on an event camera. IEEE TPAMI, 44(5):2519–2533, 2020.
- Low-power dynamic object detection and classification with freely moving event cameras. Frontiers in neuroscience, 14:135, 2020.
- High speed and high dynamic range video with an event camera. IEEE TPAMI, 43(6):1964–1980, 2019.
- Cross-modality distillation: A case for conditional generative adversarial networks. In ICASSP, pages 2926–2930. IEEE, 2018.
- High-resolution image synthesis with latent diffusion models, 2021.
- Aegnn: Asynchronous event-based graph neural networks. In CVPR, pages 12371–12381, 2022.
- Fast image reconstruction with an event camera. In WACV, pages 156–163, 2020.
- Entropy coding-based lossless compression of asynchronous event sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3922–3929, 2023.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
- Lossless adaptation of pretrained vision models for robotic manipulation. arXiv preprint arXiv:2304.06600, 2023.
- Secrets of event-based optical flow. In ECCV, pages 628–645. Springer, 2022.
- Hats: Histograms of averaged time surfaces for robust event-based object classification. In CVPR, pages 1731–1740, 2018.
- Event-based motion segmentation by motion compensation. In ICCV, pages 7244–7253, 2019.
- Event-based frame interpolation with ad-hoc deblurring. In CVPR, pages 18043–18052, 2023.
- Ess: Learning event-based semantic segmentation from still images. In ECCV, pages 341–357. Springer, 2022.
- Real: Resilience and adaptation using large language models on autonomous aerial robots. In 2nd Workshop on Language and Robot Learning: Language as Grounding, 2023.
- Revisiting color-event based tracking: A unified network, dataset, and metric. arXiv preprint arXiv:2211.11010, 2022.
- Cross-modal knowledge distillation for action recognition. In ICIP, pages 6–10. IEEE, 2019.
- Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
- Fusing event-based and rgb camera for robust object detection in adverse conditions. In ICRA, pages 933–939. IEEE, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Time lens: Event-based video frame interpolation. In CVPR, pages 16155–16164, 2021.
- Time lens++: Event-based frame interpolation with parametric non-linear flow and multi-scale fusion. In CVPR, pages 17755–17764, 2022.
- Learning dense and continuous optical flow from an event camera. IEEE TIP, 31:7237–7251, 2022.
- Rpeflow: Multimodal fusion of rgb-pointcloud-event for joint optical flow and scene flow estimation. In ICCV, pages 10030–10040, 2023.
- Dual transfer learning for event-based end-task prediction via pluggable event to image translation. In ICCV, pages 2135–2145, 2021a.
- Evdistill: Asynchronous events to end-task learning via bidirectional reconstruction-guided cross-modal knowledge distillation. In CVPR, pages 608–619, 2021b.
- Visevent: Reliable object tracking via collaboration of frame and event flows. IEEE TCYB, 2023.
- Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560, 2018.
- Event-based video reconstruction using transformer. In ICCV, pages 2563–2572, 2021.
- Upsnet: A unified panoptic segmentation network. In CVPR, pages 8818–8826, 2019.
- Vess: Variable event stream structure for event-based instance segmentation benchmark. In Proceedings of the 2020 4th International Conference on Digital Signal Processing, pages 112–116, 2020.
- Learning event guided high dynamic range video reconstruction. In CVPR, pages 13924–13934, 2023.
- Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, pages 325–341, 2018.
- Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
- Object tracking by jointly exploiting frame and event domain. In ICCV, pages 13043–13052, 2021.
- Spiking transformers for event-based single object tracking. In CVPR, pages 8801–8810, 2022.
- Learning deep features for discriminative localization. In CVPR, pages 2921–2929, 2016.
- Deblurring low-light images with events. IJCV, 131(5):1284–1298, 2023.
- Event-based motion segmentation with spatio-temporal graph cuts. IEEE TNNLS, 2021.
- The multivehicle stereo event camera dataset: An event camera dataset for 3d perception. IRAL, 3(3):2032–2039, 2018a.
- Ev-flownet: Self-supervised optical flow estimation for event-based cameras. arXiv preprint arXiv:1802.06898, 2018b.
- Learning graph-embedded key-event back-tracing for object tracking in event clouds. NIPS, 35:7462–7476, 2022.
- Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22045–22055, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.