Papers
Topics
Authors
Recent
Search
2000 character limit reached

Group Event Transformer (GET)

Updated 7 February 2026
  • The paper presents GET, which decouples spatial data from temporal and polarity properties using a novel Group Token representation, yielding significant performance improvements.
  • It employs an Event Dual Self-Attention block that concurrently leverages spatial and group dimensions to capture hierarchical context with enhanced runtime efficiency.
  • GET integrates a Group Token Aggregation module that downscales spatial and group dimensions while doubling channel capacity, setting new standards in event-driven vision tasks.

The Group Event Transformer (GET) is an event-based vision Transformer backbone designed for neuromorphic sensors, notably event cameras. Unlike image-based backbones, GET explicitly decouples spatial information from temporal and polarity properties inherent to event streams. Its core innovations include the Group Token representation, the Event Dual Self-Attention (EDSA) block, and the Group Token Aggregation (GTA) module. These components collectively enable effective hierarchical feature extraction across both spatial and group (time-polarity) domains, delivering state-of-the-art performance on event-driven classification and detection benchmarks (Peng et al., 2023).

1. Event Representation and Group Token Construction

GET operates on the raw asynchronous streams produced by event cameras, in which each event is recorded as a 4-tuple Ei=(xi,yi,ti,pi)E_i = (x_i, y_i, t_i, p_i), where (xi,yi)(x_i, y_i) denotes pixel location, tit_i is the timestamp, and pi{0,1}p_i \in \{0,1\} encodes polarity. To realize explicit decoupling of spatial and temporal–polarity information, GET discretizes time into KK uniform bins {T1,,TK}\{T_1, \dots, T_K\} and partitions events further by polarity, forming groups per bin and polarity (G=2KG = 2K).

Grouping is formalized as

Ej={EitiTj,pi=Pj}\mathcal{E}_j = \{ E_i \mid t_i \in T_j,\, p_i = P_j \}

where each group collects events in one time bin and one polarity.

Spatial discretization relies on patching: the image (H×WH \times W) is divided into P×PP \times P non-overlapping patches. Each event is mapped into a unique 1-D bin index using its spatial, temporal, and polarity components: dt,i=Ktit0tendt0+1, pri=(ximodP)+(yimodP)P, posi=xiP+yiPWP, i=(KHW)pi+(HW)dt,i+HWP2pri+posi.\begin{aligned} d_{t,i} &= \Big\lfloor K \cdot \frac{t_i-t_0}{t_{\rm end}-t_0+1} \Big\rfloor, \ pr_i &= (x_i \bmod P) + (y_i \bmod P)\,P, \ pos_i &= \Big\lfloor \frac{x_i}{P} \Big\rfloor + \Big\lfloor \frac{y_i}{P} \Big\rfloor\, \frac{W}{P}, \ \ell_i &= (KHW) p_i + (HW) d_{t,i} + \frac{HW}{P^2} pr_i + pos_i. \end{aligned} Events are accumulated into two dense vectors for each bin: an unweighted count and a time-normalized sum. The resulting representation, after reshaping and concatenation, is a tensor of shape

(HP×WP)×(2K2P2).\left(\tfrac{H}{P} \times \tfrac{W}{P}\right) \times (2K \cdot 2P^2).

A 3×33 \times 3 group convolution and MLP further embed these patch-prior group features into the Group Token Embedding: HP×WP\tfrac{H}{P} \times \tfrac{W}{P} tokens, each of dimension GCG \cdot C (with CC channels per group).

2. Event Dual Self-Attention Block

The EDSA block is central to GET’s hierarchical feature extraction. Each block implements two parallel multi-head self-attention operations—SSA (spatial) and GSA (group)—with dual-path residual architecture.

  • Spatial Self-Attention (SSA): Tokens are partitioned into non-overlapping spatial windows of size SS. Within each window XRS×(GC)X \in \mathbb{R}^{S \times (G C)},

Q=XWQ,K=XWK,V=XWV,Q = XW^Q,\, K = XW^K,\, V = XW^V,

and SSA operates as

SSA(X)=Softmax(QKd+Bp)V,\mathrm{SSA}(X) = \mathrm{Softmax}\left( \tfrac{QK^\top}{\sqrt{d}} + B_p \right) V,

where BpB_p is a learned relative position bias.

  • Group Self-Attention (GSA): Follows SSA, with transposed input, learning self-attention in the group (temporal–polarity) dimension using learned group biases BgB_g:

GSA(X)=Softmax(QgKgdg+Bg)Vg.\mathrm{GSA}(X) = \mathrm{Softmax}\left( \tfrac{Q_g K_g^\top}{\sqrt{d_g}} + B_g \right) V_g.

  • Dual Residuals and Fusion: The inputs and outputs of SSA and GSA form two residual streams:

Zs=X+Ys,Zsp=Ys+Yg,Z_s = X + Y_s, \quad Z_{sp} = Y_s + Y_g,

which are concatenated and passed through a point-wise MLP and LayerNorm.

  • Block Design: EDSA blocks are implemented with multiple parallel heads; each residual is followed by LayerNorm and a two-layer MLP (with GELU nonlinearity) and additional residual links.

3. Group Token Aggregation Module

GTA introduces hierarchical aggregation, reducing both spatial and group dimensions between stages while preserving computational efficiency by doubling the per-group channel dimension.

  • Overlapping Group Convolution: The input N×(GC)N \times (G C) (interpreted as NN “pixels” with GG group vectors) undergoes group convolution. Each of the G/2\lfloor G/2 \rfloor new groups is constructed by sliding a kernel across adjacent groups, with overlapping determined by kernel and stride parameters.
  • Spatial Downsampling: After group convolution, a 3×33 \times 3 max pooling reduces the spatial grid size by half. The resulting output is (N/2)×(G/22C)(N/2) \times (\lfloor G/2 \rfloor \cdot 2C), followed by LayerNorm.

4. Backbone Architecture and Parameterization

The GET backbone is modular, structured into three (classification) or four (detection) stages:

  1. Group Token Embedding (GTE): Converts raw event stream to tokens (HP×WP\tfrac{H}{P} \times \tfrac{W}{P}, each of size G48G \cdot 48).
  2. Stage 1: Two EDSA–MLP–Norm blocks; token resolution unchanged.
  3. GTA: Halves token and group resolution; doubles per-group channel width.
  4. Stage 2: Two EDSA blocks.
  5. GTA: Further halving and channel doubling.
  6. Stage 3: Eight EDSA blocks; final token processing.
  7. Classification Head: Global average pooling, followed by a linear classifier.

Typical parameterization with H ⁣= ⁣128,W ⁣= ⁣128,P ⁣= ⁣4,K ⁣= ⁣6H\!=\!128, W\!=\!128, P\!=\!4, K\!=\!6 yields G ⁣= ⁣12G\!=\!12. The model contains approximately 4.5M parameters. Computational cost is dominated by local SSA (S=72S=7^2) and GSA (G×SG \times S) attention operations.

5. Downstream Integration and Optimization

Classification: A single linear head processes globally pooled features, trained with standard cross-entropy loss.

Detection: GET is employed as backbone within YOLOX, optionally augmented with ConvLSTM layers for temporal memory. The detection objective is the YOLOX compound loss: objectness, classification, and bounding-box regression.

Data Augmentation: Procedures include random horizontal flipping (with polarity swap for gesture tasks), spatio-temporal cropping, and event-based MixUp and CutMix.

Training: The model is trained from scratch for 1,000 epochs (classification) or 400,000 steps (detection) using AdamW, cosine learning-rate scheduling, and weight decay of 0.05.

6. Performance and Empirical Evaluation

GET achieves state-of-the-art results on four event-based classification and two detection datasets. Summary statistics are provided below:

Dataset GET Top-1 Accuracy / mAP Previous SOTA Metric
CIFAR10-DVS 84.8% 78.1% (Nested-T) Top-1 accuracy
N-MNIST 99.7% 99.3% Top-1 accuracy
N-CARS 96.7% 95.3% Top-1 accuracy
DVS128Gesture 97.9% 96.2% (EvT) Top-1 accuracy
Gen1 (detection) 38.7% / 47.9% (*) 38.2% / 47.2% (RVT-B) COCO mAP@50–95
1Mpx (detection) 40.6% / 48.4% (*) 40.3% / 47.4% COCO mAP@50–95

(*) With/without ConvLSTM memory.

Ablation studies indicate the Group Token Embedding provides a 2.7–3.1% performance gain; replacing EDSA with SSA reduces accuracy by 3.3%, and omitting GTA reduces detection mAP by 0.9% but increases runtime speed. The GET pipeline, including event conversion and inference, achieves ≈16 ms runtime for 50 ms of events on a 1080Ti GPU, outperforming all prior end-to-end approaches.

7. Innovations and Impact

GET introduces three key contributions: (1) Group Token representation for separating time/polarity from spatial domains, (2) Event Dual Self-Attention blocks enabling nuanced, hierarchical context modeling, and (3) Group Token Aggregation to maintain computational tractability while preserving representational power. Its architecture consistently outperforms other event-based backbones in both accuracy and runtime efficiency, establishing a new standard for event-driven vision models (Peng et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group Event Transformer (GET).