EDnCNN: Event Denoising CNNs
- EDnCNN is a specialized CNN model that uses event probability masks to effectively remove noise from DVS event streams in real time.
- The architecture features three convolutional layers with batch normalization, ReLU activation, and dropout, enabling efficient, embedded hardware inference.
- Benchmark results on the DVSNOISE20 dataset demonstrate EDnCNN’s superiority by reducing the Relative Plausibility Measure of Denoising by approximately 148 points over baseline methods.
Event Denoising Convolutional Neural Networks (EDnCNN) are specialized models designed for noise removal in event-based data generated by Dynamic Vision Sensor (DVS) or neuromorphic cameras. The EDnCNN concept is intrinsically linked with the Event Probability Mask (EPM)—a probabilistic labeling technique—and is benchmarked using DVSNOISE20, a dataset capturing real-world noise and event characteristics. EDnCNN achieves state-of-the-art performance in denoising, operating efficiently on high-throughput data streams with suitability for real-time inference on embedded hardware (Baldwin et al., 2020).
1. The Event Probability Mask (EPM) and Mathematical Framework
EDnCNN leverages the Event Probability Mask (EPM) to define ground-truth event plausibility based on scene dynamics and sensor parameters. For a DVS event stream
where is the timestamp and the polarity at pixel , EPM describes the likelihood that at least one event occurs at pixel in the interval . Introducing the indicator variable
the EPM is formally
In a noise-free DVS model with optical flow and contrast threshold , Theorem 1 provides a closed-form: $M(\mathbf X,t)= \begin{cases} \dfrac{\tau\;\bigl|J_{t}(\mathbf X,t)\bigr|}{\varepsilon}, & \text{if }|J_t(\mathbf X,t)|<\dfrac{\varepsilon}{\tau}, \[1em] 1, & \text{otherwise,} \end{cases}$ where 0 is the log-intensity, and its temporal derivative
1
is computed given scene flow 2. Under rotational motion, camera IMU and APS data enable explicit EPM calculation, forming the supervisory signal for EDnCNN (Baldwin et al., 2020).
2. EDnCNN Model Architecture and Feature Construction
Input features are constructed via "time-surfaces." For each incoming event at 3, an 4 spatial patch centered at 5 is considered. For each polarity and each of the 6 most recent events at each pixel, the relative timestamp difference 7 is computed, yielding a 4D tensor: 8
The EDnCNN is a shallow, non-residual convolutional backbone:
- Three convolutional layers: each with 9 kernels, stride 1, appropriate padding, with feature maps 0.
- Activation and normalization: each convolutional block is followed by ReLU activation, Batch Normalization, and Dropout regularization.
- Classification head: after flattening, two fully connected layers—one with ReLU, the final with sigmoid or softmax—produce a scalar output 1, thresholded at 0.5 for binary event retention.
The architecture emphasizes computational efficiency and is optimized for high-throughput inference, enabling real-time deployment on embedded platforms (Baldwin et al., 2020).
3. Training Paradigm, Objectives, and Regularization
Training is grounded in three mathematically equivalent formulations, with pixelwise supervision derived from the EPM: 1. Plausibility maximization: optimize the expected compatibility of predictions 2 with soft label 3,
4
- 5 loss to soft label:
6
- Binary classification of hard labels: converting 7 into 8 and minimizing 9 or cross-entropy loss.
The network is optimized using Adam, with a starting learning rate of 0 and decay by a factor of 0.1. Each conv layer employs Dropout and Batch Normalization. Training employs leave-one-scene-out cross-validation on DVSNOISE20. The ground-truth labels are calibrated for each scene via MLE estimates of the DVS threshold 1 and APS offset 2 (Baldwin et al., 2020).
4. DVSNOISE20 Dataset and Label Generation
DVSNOISE20 is a curated dataset for event denoising evaluation:
- Hardware: DAVIS346 (346×260 px, 120 dB DVS dynamic range), 40 fps APS, and 1 kHz IMU, mounted on a rotational gimbal to restrict to pure rotational motion.
- Data composition: 16 diverse scenes (indoor and outdoor), each filmed three times for about 16 seconds, yielding 48 sequences.
- Calibration and labeling: APS images undergo fixed-pattern correction; the contrast threshold 3 and offset 4 are sequence-wise estimated by MLE. EPMs are computed using APS spatial gradients and IMU angular velocities, providing pixelwise soft ground-truth for supervision.
- Noise sources: background activity (spurious single-pixel events), holes from missed genuine events, and event timing/magnitude jitter are all represented (Baldwin et al., 2020).
5. Performance Evaluation and Benchmarking
The Relative Plausibility Measure of Denoising (RPMD) is the primary benchmark: 5 where lower values denote better denoising and zero is optimal.
Empirical results averaged across 16 scenes are summarized below:
| Method | Mean RPMD ↓ | Scenes Best (out of 16) |
|---|---|---|
| Raw (noisy) | ≈ 200 | 0 |
| FSAE [31] | ≈ 215 | 0 |
| IE [9] | ≈ 205 | 0 |
| IE + TE [9] | ≈ 130 | 1 (LabFast) |
| BAF [24] | ≈ 100 | 3 |
| NN [37] | ≈ 105 | 3 |
| NN2 [29] | ≈ 110 | 3 |
| EDnCNN | ≈ 60 | 12 |
EDnCNN outperforms all baselines, including both traditional and NN-based methods, and achieves a mean reduction of ≈148 RPMD points versus raw data (p = 0.0025, Wilcoxon signed-rank). It demonstrates robustness in synthetic tests as background activity noise rates increase. Qualitative overlays emphasize retention of sharp event edges and effective suppression of isolated noise (Baldwin et al., 2020).
6. Implementation Aspects, Inference, and Limitations
EDnCNN inference requires only the DVS event stream, without reliance on APS or IMU during deployment. The network’s shallow configuration (three convolutional, two FC layers) enables real-time inference on embedded GPUs. Memory requirements scale as 6 to accommodate 7 recent events per pixel per polarity.
EPM-based labeling requires static scenes with rotation-only camera motion under steady lighting. Under extremely slow motion, small EPM values provide less confident supervision.
Typical inference pipeline:
8
Once trained, EDnCNN infers event plausibility in arbitrary motion scenarios, decoupled from the APS and IMU sensors required only during supervised training (Baldwin et al., 2020). A plausible implication is that the model architecture and feature engineering are tuned for efficiency and hardware deployment scalability.
7. Related Methods and Context
EDnCNN’s performance is contextualized against baseline methods including FSAE, IE, TE, BAF, and various NN-based models. EDnCNN achieves superior RPMD in 12 out of 16 scenes compared to traditional filtering and previous neural techniques. This suggests that EPM-driven supervision and locally-aware, timestamp-based feature representation confer significant advantages in real-world noise removal for event-based sensors. The architecture and evaluation protocol represent a reference standard for subsequent neuromorphic denoising research (Baldwin et al., 2020).