Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EventBind: Learning a Unified Representation to Bind Them All for Event-based Open-world Understanding (2308.03135v6)

Published 6 Aug 2023 in cs.CV

Abstract: In this paper, we propose EventBind, a novel and effective framework that unleashes the potential of vision-LLMs (VLMs) for event-based recognition to compensate for the lack of large-scale event-based datasets. In particular, due to the distinct modality gap with the image-text data and the lack of large-scale datasets, learning a common representation space for images, texts, and events is non-trivial.Intuitively, we need to address two key challenges: 1) how to generalize CLIP's visual encoder to event data while fully leveraging events' unique properties, e.g., sparsity and high temporal resolution; 2) how to effectively align the multi-modal embeddings, i.e., image, text, and events. Accordingly, we first introduce a novel event encoder that subtly models the temporal information from events and meanwhile, generates event prompts for modality bridging. We then design a text encoder that generates content prompts and utilizes hybrid text prompts to enhance EventBind's generalization ability across diverse datasets.With the proposed event encoder, text encoder, and image encoder, a novel Hierarchical Triple Contrastive Alignment (HTCA) module is introduced to jointly optimize the correlation and enable efficient knowledge transfer among the three modalities. We evaluate various settings, including fine-tuning and few-shot on three benchmarks, and our EventBind achieves new state-of-the-art accuracy compared with the previous methods, such as on N-Caltech101 (+5.34% and +1.70%) and N-Imagenet (+5.65% and +1.99%) with fine-tuning and 20-shot settings, respectively. Moreover, our EventBind can be flexibly extended to the event retrieval task using text or image queries, showing plausible performance. Project page:https://vlislab22.github.io/EventBind/.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiazhou Zhou (9 papers)
  2. Xu Zheng (87 papers)
  3. Yuanhuiyi Lyu (25 papers)
  4. Lin Wang (403 papers)
Citations (8)