RGB-Event Paired Dataset Overview

Updated 25 November 2025

RGB-Event paired datasets are multimodal collections synchronizing RGB images and event-based data for high temporal and spatial analyses.
They utilize specialized sensors and calibration techniques, such as DAVIS346 and stereo rigs, to achieve sub-millisecond, pixel-level alignment.
These datasets enable research in cross-modal fusion, object tracking, and activity recognition under dynamic scenes and challenging lighting conditions.

An RGB-Event paired dataset is a multimodal collection comprising synchronously acquired RGB (frame-based) images and event (asynchronous, high temporal resolution) streams, typically captured with hybrid vision sensors or carefully calibrated multi-camera rigs. These datasets offer pixel- or sample-level alignment between conventional intensity data and neuromorphic event data, enabling research into cross-modal fusion, event-based perception, and high-temporal-resolution vision tasks. They underpin major advances in event-based learning, multimodal fusion, and the benchmarking of algorithms for scenarios with high motion dynamics, severe lighting conditions, or latency constraints.

1. Principles of RGB-Event Paired Datasets

At the core of an RGB-Event paired dataset lies the simultaneous acquisition or precise temporal alignment of RGB frames and event streams. “Events” are tuples $(x,y,t,p)$ , where $(x,y)$ are pixel coordinates, $t$ is the timestamp, and $p\in\{\pm1\}$ is the polarity (sign of brightness change), generated asynchronously whenever a logarithmic brightness change exceeds a contrast threshold. The frame data provides conventional image context, while events encode high-frequency spatio-temporal structure. Early work leveraged specialized sensors, such as the DAVIS346 and Prophesee GEN4, to produce true hardware-level alignment; contemporary datasets increasingly supplement or simulate this via post-hoc matching, upsampling, or event simulation frameworks when needed.

These datasets are designed for diverse research tasks, from self-supervised contrastive pretraining (Wu et al., 17 Apr 2025) to downstream applications—object tracking, pose estimation, event-guided image enhancement, activity recognition, and cross-modal ReID.

2. Acquisition Hardware, Synchronization, and Calibration

Acquisition setups vary from unified sensors like DAVIS346 (which output both RGB frames and events on a shared clock) to composite rigs where RGB and event cameras are mounted in rigid stereo configurations. Key examples and protocols:

Single-sensor (native alignment): Datasets such as HARDVS and its derivatives (e.g., REV2M in (Wu et al., 17 Apr 2025)) use the DAVIS346, guaranteeing pixel- and timestamp-level correspondence by design, as events and frames share the same chip and hardware clock.
Stereo or beam-splitter rigs: High-resolution and physically decoupled modalities (as in LSE-NeRF (Tang et al., 9 Sep 2024), Event-RGB spacecraft datasets (Jawaid et al., 8 Jul 2025)) are achieved using stereo mounts or beam-splitters; calibration is conducted via checkerboard or signal-based protocols. Intrinsic and extrinsic calibration matrices ( $K$ , $D$ , $[R|t]$ ) are estimated (commonly via OpenCV stereoCalibrate).
Temporal alignment: Frame-exposure signals can be looped back into the event sensor as hardware triggers, ensuring sub-millisecond synchronization (Tang et al., 9 Sep 2024). Alternatively, simultaneous recording with accurate timestamp support and post-hoc interpolation (e.g., aligning via chronometer patterns in Neuromorphic Drone Detection (Magrini et al., 24 Sep 2024)) achieve $\leq1$ ms precision.
No explicit calibration: Some composite datasets (e.g., REV2M (Wu et al., 17 Apr 2025)) aggregate pairs from independent sources without new joint calibration, inheriting all spatial/temporal properties from the original recordings.

3. Data Representation, Storage, and Preprocessing

Across published datasets, the following representations are pervasive:

Raw formats: Events are distributed in native formats such as .aedat4, .dat, or proprietary containers (see (Wu et al., 17 Apr 2025, Wang et al., 9 Mar 2024)). Each file encodes sequences of $(x, y, t, p)$ events at typically microsecond resolution.
Frame and event alignment: For RGB frames at times $T_i$ , event streams are binned into windows—often $[T_i, T_{i+1})$ —to create event “frames” or voxel grids $E_i\in\mathbb{R}^{H \times W \times C}$ . Accumulation functions range from simple polarity histograms to 3D voxel or patchwise voxel embeddings ( $E_{vox}$ as in $E_{vox}\in\mathbb{R}^{196 \times 768}$ in (Wu et al., 17 Apr 2025)).
Event-voxel and patch structures: Temporal and spatial groups—such as the 196-patch split (14×14) in REV2M—facilitate frame-wise and downstream split batching for transformer or contrastive fusion backbones.
Preprocessing: Datasets typically provide scripts to convert event streams to accumulated images, voxel grids, or binary masks. Event denoising and temporal binning are applied as needed; the exact choice is often left to end-users.

4. Dataset Scale, Categories, and Statistics

RGB-Event paired datasets are notable for their diversity in scale, coverage, and application targets. REV2M (Wu et al., 17 Apr 2025), the largest currently, contains over 2.5 million pairs, constructed by aggregating five public benchmarks. Category and pair distributions are determined by the original sources:

Source	#Pairs	Categories	Sensor/Res.
HARDVS	827,694	300 human actions	DAVIS346, 128×128
N-ImageNet	1,281,166	ILSVRC/Imagenet classes	Prophesee GEN4 VGA
COESOT	231,277	90 tracking types	DVS346, 346×260
Visevent	185,127	Object/scene tracking	DAVIS346, 260×346
DSEC-MOD	10,495	Automotive	EVK4-HD, 1280×720

This suggests downstream applicability across action recognition, tracking, detection, and more. Precise statistics—event densities, class/scene splits—are deferred to original dataset documentation. REV2M lacks explicit train/val/test splits, deploying all pairs for self-supervised pretraining (Wu et al., 17 Apr 2025).

Scene and action categories span from hundreds of human action classes to urban/automotive, with inherent diversity in lighting, motion, background, and occlusion.

5. Annotation Protocols, Licensing, and Accessibility

Annotation in RGB-Event paired datasets typically adheres to the underlying sources. REV2M and other fusion datasets do not provide new ground-truth for tasks such as depth, flow, or fine-grained object data, but retain all higher-level or task-specific labels extant in constituent datasets (e.g., action category, object bounding boxes, or scene class). Licensing and access thus inherit constraints from source datasets: academic-only, CC-BY, or MIT-like, varying per component (see COESOT, HARDVS, N-ImageNet, etc.).

Pair-level annotation may include:

Action or object class.
Sequence and scene identifiers.
For some tasks, per-frame bounding boxes, tracking IDs, or object masks—when supported by originals.

Code, download scripts, and usage examples for loading and batch-processing the data are routinely supplied (e.g., https://github.com/Event-AHU/CM3AE for REV2M). However, details regarding storage hierarchy, naming conventions, or sample selection must be inferred directly from the source dataset formats.

6. Limitations, Impact, and Research Directions

Key limitations of major RGB-Event paired datasets include:

Absence of new joint calibration: Aggregated datasets like REV2M do not standardize calibration, time alignment, or storage structure, necessitating dataset-specific handling in downstream research (Wu et al., 17 Apr 2025).
No per-pair statistics or benchmarking splits: There are no intrinsic distributions (e.g., event rates, dynamic range, event-per-pixel) published for aggregated sets; users must rely on original papers.
Annotation granularity strictly inherited: Any supplementary annotation—beyond original class/box/flow labels—is not introduced at merge time.

Nonetheless, these datasets fill a critical gap in multi-modal fusion and representation learning. They have enabled multimodal pretraining frameworks—such as CM3AE (Wu et al., 17 Apr 2025), which leverages both event frames and event-voxel embeddings for contrastive and reconstruction-based pretraining—and inform the design and benchmarking of cross-modal fusion architectures for tracking, detection, and dynamic scene understanding under high-speed, HDR, or adverse lighting.

Future trajectories involve unifying calibration and split protocols, expanding to joint raw + event + depth or IMU signals, and devising novel benchmarks that exploit the high spatio-temporal fidelity of event data for general-purpose perception and multimodal representation learning.