EV-FlowNet: Event-based Optical Flow
- EV-FlowNet is a self-supervised deep learning pipeline for estimating dense optical flow from event-based cameras using a compact event-tensor and U-Net-inspired encoder–decoder.
- It employs multi-scale flow heads and self-supervised photometric and smoothness losses to refine predictions without requiring ground-truth flow.
- Extensions with recurrent and spiking neural architectures enable temporally dense (up to 100 Hz) and energy-efficient flow estimation for high-speed vision.
EV-FlowNet is a self-supervised deep learning pipeline for dense optical flow estimation from event-based cameras, specifically designed to address the unique data modalities and constraints associated with asynchronous event streams. Its formulation includes a compact event-tensor encoding, a U-Net-inspired network architecture, and a self-supervised training regimen based on grayscale image supervision. EV-FlowNet served as the foundational model for subsequent sequential and spiking neural extensions that enabled temporally dense (e.g., 100 Hz) flow estimation and highly efficient low-power inference.
1. Event Data Representation
Event cameras emit asynchronous events , where is the pixel location, is the timestamp, and indicates polarity (increasing or decreasing log-intensity). An event is triggered by a threshold change in log-intensity:
EV-FlowNet aggregates the raw event stream over a window into a fixed-size, image-like tensor with four channels per pixel:
- : total count of positive events at
- : total count of negative events at
- : normalized timestamp of the last positive event at
- : normalized timestamp of the last negative event at
Formally,
This encoding preserves spatial locality, event frequency, and recency, allowing downstream convolutional networks to process events as pseudo-images. Both and are clipped to for range compatibility.
2. Network Architecture
EV-FlowNet utilizes a multi-scale encoder–decoder structure resembling U-Net to predict per-pixel 2D flow vectors . The architecture includes:
- Encoder: Four strided 2D convolutions (, stride 2, padding 1), doubling channels at each stage: .
- Bottleneck: Two residual blocks, both with 512 channels:
- Decoder: Four upconvolutional stages. Each upsamples by , applies a conv, and halves channels. Skip connections from encoder stages are concatenated (“”).
- Multi-scale Flow Heads: At each decoder level, a $2$-channel “flow head” predicts flow for that scale, passed through . Multi-scale flows are upsampled for loss computation and concatenated into subsequent decoder operations for refinement.
The forward pass can be summarized as:
3. Self-Supervised Loss Functions
EV-FlowNet is trained without access to ground-truth flow, instead using a self-supervised approach leveraging frame-based grayscale images synchronized with the events.
- Photometric Loss: The network-predicted flow is used to warp frame to , with a per-pixel brightness constancy loss:
The penalty uses the generalized Charbonnier function:
- Smoothness Loss: A first-order neighborhood term penalizes local flow variation:
where denotes 8-connected neighboring pixels.
- Total Multi-scale Loss: Each intermediate and final flow estimate is supervised by downsampled frames, with the total loss
4. Training and Evaluation Protocol
Training Pipeline:
- Data: MVSEC “outdoor_day1” and “outdoor_day2” (DAVIS sensor: events, frames, pose/depth ground truth).
- Event window: For frames, all events in are aggregated for input.
- Augmentation: Random horizontal flips, crop to .
- Optimization: Adam, initial , exponential decay (0.8 per 4 epochs), Charbonnier with , , smoothness , trained for 300k iterations (12 hours on V100 16 GB).
Evaluation Metrics:
- Average Endpoint Error (AEE):
( is the set of active, valid pixels)
- Outlier Rate: of pixels with EE 3 px and 5% of magnitude.
Ground-truth Flow: Generated from pose/depth via:
Results (MVSEC outdoor_day1, dt=1 and dt=4):
| Model | AEE (dt=1) | Outliers (dt=1) | AEE (dt=4) | Outliers (dt=4) |
|---|---|---|---|---|
| UnFlow | 0.97 px | 1.6% | 2.95 px | 40.0% |
| EV-FlowNet | 0.49 px | 0.2% | 1.23 px | 7.3% |
EV-FlowNet achieves lower AEE and outlier rates than image-based UnFlow, especially at larger .
5. Extensions: Temporally Dense and Sequential Flow Estimation
Recent developments have extended the EV-FlowNet paradigm to recurrent neural architectures for temporally dense (e.g., 100 Hz) flow estimation (Ponghiran et al., 2022).
- Sequential Event Slicing: Rather than aggregating all events over a fixed frame interval, slice the stream into fine-grained (e.g., 10 ms) windows, yielding input rates up to 100 Hz.
- Recurrent Architectures:
- LSTM-FlowNet: A U-Net structure where all convolutions are replaced by ConvLSTM layers. Flow is predicted from the decoder's finest-scale hidden state at each timestep.
- EfficientSpike-FlowNet: All nonlinearities are replaced by leaky integrate-and-fire (LIF) neurons. Membrane potentials update as:
where is the Heaviside step function.
- Loss Function: Pure supervised loss on flow against (possibly linearly interpolated) ground truth:
- Key Training Technique: Warm-up frames + truncated BPTT to stabilize long recurrent inference.
Comparison (DSEC dataset):
| Model | Rate | AEE |
|---|---|---|
| EV-FlowNet (baseline) | 10 Hz | 0.67 |
| LSTM-FlowNet (proposed) | 100 Hz | 0.60 |
| EfficientSpike-FlowNet | 100 Hz | 2.66 |
LSTM-FlowNet demonstrates 13% lower AEE than baseline EV-FlowNet with higher temporal resolution.
Efficiency:
- EV-FlowNet: $16.6$M parameters; normalized compute cost
- LSTM-FlowNet: $53.6$M parameters, uses %%%%5152%%%% energy (or at $100$ Hz)
- EfficientSpike-FlowNet: $16.6$M parameters, only of LSTM energy at $100$ Hz, leveraging sparse event-driven computation.
6. Limitations and Prospective Directions
Identified limitations and potential enhancements include:
- Timestamp Saturation: Rapid event bursts can override timestamp channels, reducing spatial discriminability.
- Event Sparsity: If event window is too small, insufficient input signal may result.
- Supervisory Dependency: The reliance on frame-based photometric loss inherits brightness-constancy assumptions and occlusion sensitivity.
Suggested extensions focus on:
- Incorporating event-only loss functions, such as motion-compensated event alignment, to obviate the need for grayscale supervision.
- Broadening self-supervised paradigms to other event-based domains (e.g., depth, egomotion, semantic segmentation) using event-specific consistency or reconstruction constraints.
- Developing richer, possibly learned, event tensor embeddings and spatio-temporal surfaces to encode more nuanced event histories.
7. Impact and Research Context
EV-FlowNet demonstrated that effective optical flow estimation from events can be achieved via a compact 4-channel summary and a multi-scale U-Net architecture trained with self-supervised photometric and smoothness losses (Zhu et al., 2018). Its approach outperformed frame-based optical flow baselines on event camera data and inspired sequential and neuromorphic network extensions (e.g., ConvLSTM, SNNs) supporting temporally dense and energy-efficient inference (Ponghiran et al., 2022). The continued development of event-driven learning paradigms, recurrent representations, and event-only supervisory signals is enabling responsive and power-efficient vision pipelines for high-speed and high-dynamic-range sensing.