Group Event Transformer (GET)
- The paper presents GET, which decouples spatial data from temporal and polarity properties using a novel Group Token representation, yielding significant performance improvements.
- It employs an Event Dual Self-Attention block that concurrently leverages spatial and group dimensions to capture hierarchical context with enhanced runtime efficiency.
- GET integrates a Group Token Aggregation module that downscales spatial and group dimensions while doubling channel capacity, setting new standards in event-driven vision tasks.
The Group Event Transformer (GET) is an event-based vision Transformer backbone designed for neuromorphic sensors, notably event cameras. Unlike image-based backbones, GET explicitly decouples spatial information from temporal and polarity properties inherent to event streams. Its core innovations include the Group Token representation, the Event Dual Self-Attention (EDSA) block, and the Group Token Aggregation (GTA) module. These components collectively enable effective hierarchical feature extraction across both spatial and group (time-polarity) domains, delivering state-of-the-art performance on event-driven classification and detection benchmarks (Peng et al., 2023).
1. Event Representation and Group Token Construction
GET operates on the raw asynchronous streams produced by event cameras, in which each event is recorded as a 4-tuple , where denotes pixel location, is the timestamp, and encodes polarity. To realize explicit decoupling of spatial and temporal–polarity information, GET discretizes time into uniform bins and partitions events further by polarity, forming groups per bin and polarity ().
Grouping is formalized as
where each group collects events in one time bin and one polarity.
Spatial discretization relies on patching: the image () is divided into non-overlapping patches. Each event is mapped into a unique 1-D bin index using its spatial, temporal, and polarity components: Events are accumulated into two dense vectors for each bin: an unweighted count and a time-normalized sum. The resulting representation, after reshaping and concatenation, is a tensor of shape
A group convolution and MLP further embed these patch-prior group features into the Group Token Embedding: tokens, each of dimension (with channels per group).
2. Event Dual Self-Attention Block
The EDSA block is central to GET’s hierarchical feature extraction. Each block implements two parallel multi-head self-attention operations—SSA (spatial) and GSA (group)—with dual-path residual architecture.
- Spatial Self-Attention (SSA): Tokens are partitioned into non-overlapping spatial windows of size . Within each window ,
and SSA operates as
where is a learned relative position bias.
- Group Self-Attention (GSA): Follows SSA, with transposed input, learning self-attention in the group (temporal–polarity) dimension using learned group biases :
- Dual Residuals and Fusion: The inputs and outputs of SSA and GSA form two residual streams:
which are concatenated and passed through a point-wise MLP and LayerNorm.
- Block Design: EDSA blocks are implemented with multiple parallel heads; each residual is followed by LayerNorm and a two-layer MLP (with GELU nonlinearity) and additional residual links.
3. Group Token Aggregation Module
GTA introduces hierarchical aggregation, reducing both spatial and group dimensions between stages while preserving computational efficiency by doubling the per-group channel dimension.
- Overlapping Group Convolution: The input (interpreted as “pixels” with group vectors) undergoes group convolution. Each of the new groups is constructed by sliding a kernel across adjacent groups, with overlapping determined by kernel and stride parameters.
- Spatial Downsampling: After group convolution, a max pooling reduces the spatial grid size by half. The resulting output is , followed by LayerNorm.
4. Backbone Architecture and Parameterization
The GET backbone is modular, structured into three (classification) or four (detection) stages:
- Group Token Embedding (GTE): Converts raw event stream to tokens (, each of size ).
- Stage 1: Two EDSA–MLP–Norm blocks; token resolution unchanged.
- GTA: Halves token and group resolution; doubles per-group channel width.
- Stage 2: Two EDSA blocks.
- GTA: Further halving and channel doubling.
- Stage 3: Eight EDSA blocks; final token processing.
- Classification Head: Global average pooling, followed by a linear classifier.
Typical parameterization with yields . The model contains approximately 4.5M parameters. Computational cost is dominated by local SSA () and GSA () attention operations.
5. Downstream Integration and Optimization
Classification: A single linear head processes globally pooled features, trained with standard cross-entropy loss.
Detection: GET is employed as backbone within YOLOX, optionally augmented with ConvLSTM layers for temporal memory. The detection objective is the YOLOX compound loss: objectness, classification, and bounding-box regression.
Data Augmentation: Procedures include random horizontal flipping (with polarity swap for gesture tasks), spatio-temporal cropping, and event-based MixUp and CutMix.
Training: The model is trained from scratch for 1,000 epochs (classification) or 400,000 steps (detection) using AdamW, cosine learning-rate scheduling, and weight decay of 0.05.
6. Performance and Empirical Evaluation
GET achieves state-of-the-art results on four event-based classification and two detection datasets. Summary statistics are provided below:
| Dataset | GET Top-1 Accuracy / mAP | Previous SOTA | Metric |
|---|---|---|---|
| CIFAR10-DVS | 84.8% | 78.1% (Nested-T) | Top-1 accuracy |
| N-MNIST | 99.7% | 99.3% | Top-1 accuracy |
| N-CARS | 96.7% | 95.3% | Top-1 accuracy |
| DVS128Gesture | 97.9% | 96.2% (EvT) | Top-1 accuracy |
| Gen1 (detection) | 38.7% / 47.9% (*) | 38.2% / 47.2% (RVT-B) | COCO mAP@50–95 |
| 1Mpx (detection) | 40.6% / 48.4% (*) | 40.3% / 47.4% | COCO mAP@50–95 |
(*) With/without ConvLSTM memory.
Ablation studies indicate the Group Token Embedding provides a 2.7–3.1% performance gain; replacing EDSA with SSA reduces accuracy by 3.3%, and omitting GTA reduces detection mAP by 0.9% but increases runtime speed. The GET pipeline, including event conversion and inference, achieves ≈16 ms runtime for 50 ms of events on a 1080Ti GPU, outperforming all prior end-to-end approaches.
7. Innovations and Impact
GET introduces three key contributions: (1) Group Token representation for separating time/polarity from spatial domains, (2) Event Dual Self-Attention blocks enabling nuanced, hierarchical context modeling, and (3) Group Token Aggregation to maintain computational tractability while preserving representational power. Its architecture consistently outperforms other event-based backbones in both accuracy and runtime efficiency, establishing a new standard for event-driven vision models (Peng et al., 2023).