N-Caltech101 and N-CARS Datasets

Updated 2 October 2025

N-Caltech101 and N-CARS are neuromorphic vision datasets that convert static image corpora into streams of spike events using controlled sensor motion.
They employ a methodology inspired by biological microsaccades to generate events, enabling direct performance comparison with conventional vision algorithms.
Analyses reveal that while rate-coded spatial information is effectively captured, the datasets limit the demonstration of spike timing advantages in spiking neural networks.

N-Caltech101 and N-CARS are neuromorphic vision datasets generated by converting static image corpora (specifically, Caltech101 and CARS) into streams of spiking events via controlled sensor motion. These datasets are designed for benchmarking spike-based recognition algorithms using event-based vision sensors, and support direct comparison between neuromorphic and conventional computer vision approaches. Their construction, characteristics, and algorithmic performance provide foundational context for the design and evaluation of neuromorphic learning systems.

1. Dataset Generation Methodology

The core procedure for generating neuromorphic datasets—including N-Caltech101 and N-CARS—relies on the physical actuation of a vision sensor equipped with an event-based imager such as an ATIS or DVS. Traditional frame-based image datasets (e.g., Caltech101), which consist of static images, are unsuitable for event-based sensing because such sensors only respond to changes in intensity. To produce event streams, each static image is displayed and the vision sensor is moved in a prescribed pattern—a strategy directly inspired by biological microsaccades. This approach avoids artifacts related to monitor refresh rates.

During recording, an actuated pan-tilt platform (Dynamixel MX-28 motors) executes three sequential micro-saccades in the shape of an isosceles triangle for each image; this ensures the sensor returns to the same position and presents motion in multiple directions to activate diverse spatial gradients. Caltech101 images are first resized so that width ≤ 240 px and height ≤ 180 px, preserving original aspect ratio, following the method of Serre et al.

2. Physical Principles and Conversion Equations

Event generation is governed by physical principles of optical flow. The fundamental constraint is:

$I_t = -I_x V_x - I_y V_y$

where $I_t$ is the temporal derivative of intensity, $I_x$ and $I_y$ are the spatial derivatives, and $V_x$ , $V_y$ are velocities in the $x$ and $y$ axes, respectively.

Sensor movement induces pure rotational motion described by: $\begin{aligned} V_x &= ((T_z x - T_x) / z) - \omega_y + \omega_z y + \omega_x x y - \omega_y x^2 \ V_y &= ((T_z y - T_y) / z) + \omega_x - \omega_z x - \omega_y x y + \omega_x y^2 \end{aligned}$ For Caltech101 recordings, only the rotational velocities $\omega_x, \omega_y, \omega_z$ are nonzero, ensuring uniform event generation across the image.

3. Data Format and Structure

Each recording from N-Caltech101 and N-CARS consists of a sequence of spike events, formatted as tuples $(x, y, t, p)$ representing pixel address $(x, y)$ , timestamp $t$ , and polarity $p$ (ON/OFF). The event stream encoding reflects the spatiotemporal distribution of intensity changes triggered by sensor movement over the displayed image.

4. Baseline Recognition Algorithm Performance

After dataset construction, several spike-based recognition methods are evaluated:

k-Nearest Neighbour Statistics: Using descriptors such as total event count, ON/OFF event numbers, and mean/standard deviation of event coordinates. Individual features yield recognition rates of ~1.5–2.0% accuracy on N-Caltech101.
Synaptic Kernel Inverse Method (SKIM): A spike-based algorithm utilizing delayed alpha kernels to compute features from spike trains, mapped to class labels. Achieves 8.30% classification accuracy on N-Caltech101 (Orchard et al., 2015).

In contrast, frame-based convolutional neural networks trained on time-collapsed images (VGG-16, ImageNet pretraining) achieve 78.01% accuracy (Iyer et al., 2018). For comparison, identical procedures on N-MNIST yield 99.23% accuracy.

SNNs, when applied in an unsupervised setting to N-MNIST, demonstrate 91.78% accuracy. Direct SNN results for N-Caltech101 are not detailed, but the performance trends indicate that rate-based coding—i.e., spike counts per pixel—adequately recover nearly all discriminative information, with limited utility for precise spike-timing codes.

5. Neuromorphic Properties and Temporal Dynamics

Empirical studies reveal that collapse of temporal information (i.e., summing spike counts over time) results in minimal loss of discriminative power. Experiments varying the presynaptic time constant $\tau_{xpre}$ in STDP learning demonstrate that slow-decay regimes (rate-based) outperform fast-decay (timing-based) regimes.

The normalized pixel spike count formula is:

$C^{(x,p)} = \frac{\sum_i t^{(x,p)}_i} {\max_{y} \sum_i t^{(y,p)}_i}$

where $C^{(x,p)}$ denotes normalized spike count for pixel $x$ in pattern $p$ .

Leaky integrate-and-fire (LIF) neuron dynamics are given by:

$\tau_M \frac{dV_j}{dt} = -(V_j - V_{leak}) + R I_j(t)$

with spike emission when $V_j \geq V_{th}$ .

For plasticity, the weight update rule is:

$\Delta w = \eta (x_{pre} - x_{tar}) (w_{max} - w)^\mu$

where learning is driven by differences in presynaptic trace ( $x_{pre}$ ) and target ( $x_{tar}$ ), scaled by learning rate $\eta$ .

Population-rate-based STDP uses:

$h(t) = a H(t) + b$

with $H(t)$ as the instantaneous average spike rate, and the presynaptic trace given by

$x_{pre}^{(x,p)} = \Sigma_i h(t^{(x,p)}_i)$

A plausible implication is that these benchmarks do not exploit the unique computational properties of SNNs, such as spike timing-dependent plasticity.

6. Dataset Design Considerations and Limitations

Key challenges in neuromorphic dataset construction include the inability of event-based sensors to respond to static scenes and the introduction of timing artifacts via simulated monitor motion. Moving the sensor itself ensures continuous motion and biological realism, eliminating spurious frequency peaks inherent in monitor-driven datasets (e.g., MNIST-DVS).

However, critical analysis suggests that N-Caltech101 and N-CARS, generated from static images using fixed sensor trajectories, do not contain features truly encoded in the temporal domain. Their discriminative power derives largely from spatial, rate-coded information that is recoverable by conventional methods. This limits the ability of SNNs to demonstrate algorithmic superiority exploiting spike timing. The conclusion drawn is that inherently dynamic environments—with temporal features not reducible to static representations—are necessary for benchmarks that showcase spiking neural networks’ full capabilities.

7. Implications for Neuromorphic Vision Research

The creation of N-Caltech101 and N-CARS established event-based benchmarks enabling direct comparison with conventional computer vision algorithms and highlighted the technical requirements for dataset construction: precise actuation, avoidance of timing artifacts, and biological plausibility. Nonetheless, research emphasizes that future benchmarks should ensure that discriminative features are inherently temporal, not artifacts of static image conversion (Orchard et al., 2015, Iyer et al., 2018). The datasets serve as baselines for algorithmic development, but their use should be contextualized within the broader quest for time-encoded neuromorphic benchmarks.