Papers
Topics
Authors
Recent
Search
2000 character limit reached

EV-FlowNet: Event-based Optical Flow

Updated 12 February 2026
  • EV-FlowNet is a self-supervised deep learning pipeline for estimating dense optical flow from event-based cameras using a compact event-tensor and U-Net-inspired encoder–decoder.
  • It employs multi-scale flow heads and self-supervised photometric and smoothness losses to refine predictions without requiring ground-truth flow.
  • Extensions with recurrent and spiking neural architectures enable temporally dense (up to 100 Hz) and energy-efficient flow estimation for high-speed vision.

EV-FlowNet is a self-supervised deep learning pipeline for dense optical flow estimation from event-based cameras, specifically designed to address the unique data modalities and constraints associated with asynchronous event streams. Its formulation includes a compact event-tensor encoding, a U-Net-inspired network architecture, and a self-supervised training regimen based on grayscale image supervision. EV-FlowNet served as the foundational model for subsequent sequential and spiking neural extensions that enabled temporally dense (e.g., 100 Hz) flow estimation and highly efficient low-power inference.

1. Event Data Representation

Event cameras emit asynchronous events e={x,t,p}e = \{x, t, p\}, where x=(u,v)x = (u,v) is the pixel location, tt is the timestamp, and p{+1,1}p \in \{+1, -1\} indicates polarity (increasing or decreasing log-intensity). An event is triggered by a threshold change in log-intensity: logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta

EV-FlowNet aggregates the raw event stream over a window [t0,t1][t_0, t_1] into a fixed-size, image-like tensor with four channels per pixel:

  • C+(x)C^+(x): total count of positive events at xx
  • C(x)C^-(x): total count of negative events at xx
  • x=(u,v)x = (u,v)0: normalized timestamp of the last positive event at x=(u,v)x = (u,v)1
  • x=(u,v)x = (u,v)2: normalized timestamp of the last negative event at x=(u,v)x = (u,v)3

Formally,

x=(u,v)x = (u,v)4

This encoding preserves spatial locality, event frequency, and recency, allowing downstream convolutional networks to process events as pseudo-images. Both x=(u,v)x = (u,v)5 and x=(u,v)x = (u,v)6 are clipped to x=(u,v)x = (u,v)7 for range compatibility.

2. Network Architecture

EV-FlowNet utilizes a multi-scale encoder–decoder structure resembling U-Net to predict per-pixel 2D flow vectors x=(u,v)x = (u,v)8. The architecture includes:

  • Encoder: Four strided 2D convolutions (x=(u,v)x = (u,v)9, stride 2, padding 1), doubling channels at each stage: tt0.
  • Bottleneck: Two residual blocks, both with 512 channels:

tt1

  • Decoder: Four upconvolutional stages. Each upsamples by tt2, applies a tt3 conv, and halves channels. Skip connections from encoder stages are concatenated (“tt4”).
  • Multi-scale Flow Heads: At each decoder level, a tt5-channel “flow head” predicts flow tt6 for that scale, passed through tt7. Multi-scale flows are upsampled for loss computation and concatenated into subsequent decoder operations for refinement.

The forward pass can be summarized as: tt8

3. Self-Supervised Loss Functions

EV-FlowNet is trained without access to ground-truth flow, instead using a self-supervised approach leveraging frame-based grayscale images synchronized with the events.

  • Photometric Loss: The network-predicted flow is used to warp frame tt9 to p{+1,1}p \in \{+1, -1\}0, with a per-pixel brightness constancy loss:

p{+1,1}p \in \{+1, -1\}1

The penalty p{+1,1}p \in \{+1, -1\}2 uses the generalized Charbonnier function:

p{+1,1}p \in \{+1, -1\}3

  • Smoothness Loss: A first-order neighborhood term penalizes local flow variation:

p{+1,1}p \in \{+1, -1\}4

where p{+1,1}p \in \{+1, -1\}5 denotes 8-connected neighboring pixels.

  • Total Multi-scale Loss: Each intermediate and final flow estimate p{+1,1}p \in \{+1, -1\}6 is supervised by downsampled frames, with the total loss

p{+1,1}p \in \{+1, -1\}7

4. Training and Evaluation Protocol

Training Pipeline:

  • Data: MVSEC “outdoor_day1” and “outdoor_day2” (DAVIS sensor: events, frames, pose/depth ground truth).
  • Event window: For p{+1,1}p \in \{+1, -1\}8 frames, all events in p{+1,1}p \in \{+1, -1\}9 are aggregated for input.
  • Augmentation: Random horizontal flips, crop to logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta0.
  • Optimization: Adam, initial logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta1, exponential decay (0.8 per 4 epochs), Charbonnier with logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta2, logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta3, smoothness logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta4, trained for 300k iterations (logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta512 hours on V100 16 GB).

Evaluation Metrics:

  • Average Endpoint Error (AEE):

logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta6

(logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta7 is the set of active, valid pixels)

  • Outlier Rate: logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta8 of pixels with EE logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta9 3 px and [t0,t1][t_0, t_1]05% of magnitude.

Ground-truth Flow: Generated from pose/depth via: [t0,t1][t_0, t_1]1

Results (MVSEC outdoor_day1, dt=1 and dt=4):

Model AEE (dt=1) Outliers (dt=1) AEE (dt=4) Outliers (dt=4)
UnFlow 0.97 px 1.6% 2.95 px 40.0%
EV-FlowNet[t0,t1][t_0, t_1]2 0.49 px 0.2% 1.23 px 7.3%

EV-FlowNet achieves lower AEE and outlier rates than image-based UnFlow, especially at larger [t0,t1][t_0, t_1]3.

5. Extensions: Temporally Dense and Sequential Flow Estimation

Recent developments have extended the EV-FlowNet paradigm to recurrent neural architectures for temporally dense (e.g., 100 Hz) flow estimation (Ponghiran et al., 2022).

  • Sequential Event Slicing: Rather than aggregating all events over a fixed frame interval, slice the stream into fine-grained (e.g., 10 ms) windows, yielding input rates up to 100 Hz.
  • Recurrent Architectures:

    • LSTM-FlowNet: A U-Net structure where all convolutions are replaced by ConvLSTM layers. Flow is predicted from the decoder's finest-scale hidden state at each timestep.
    • EfficientSpike-FlowNet: All nonlinearities are replaced by leaky integrate-and-fire (LIF) neurons. Membrane potentials update as:

    [t0,t1][t_0, t_1]4

    where [t0,t1][t_0, t_1]5 is the Heaviside step function.

  • Loss Function: Pure supervised [t0,t1][t_0, t_1]6 loss on flow against (possibly linearly interpolated) ground truth:

[t0,t1][t_0, t_1]7

  • Key Training Technique: Warm-up frames + truncated BPTT to stabilize long recurrent inference.

Comparison (DSEC dataset):

Model Rate AEE
EV-FlowNet (baseline) 10 Hz 0.67
LSTM-FlowNet (proposed) 100 Hz 0.60
EfficientSpike-FlowNet 100 Hz 2.66

LSTM-FlowNet demonstrates [t0,t1][t_0, t_1]8 13% lower AEE than baseline EV-FlowNet with [t0,t1][t_0, t_1]9 higher temporal resolution.

Efficiency:

  • EV-FlowNet: C+(x)C^+(x)0M parameters; normalized compute cost C+(x)C^+(x)1
  • LSTM-FlowNet: C+(x)C^+(x)2M parameters, uses \sim59%%%%952%%%% energy (or C+(x)C^+(x)5 at C+(x)C^+(x)6 Hz)
  • EfficientSpike-FlowNet: C+(x)C^+(x)7M parameters, only C+(x)C^+(x)8 of LSTM energy at C+(x)C^+(x)9 Hz, leveraging sparse event-driven computation.

6. Limitations and Prospective Directions

Identified limitations and potential enhancements include:

  • Timestamp Saturation: Rapid event bursts can override timestamp channels, reducing spatial discriminability.
  • Event Sparsity: If event window xx0 is too small, insufficient input signal may result.
  • Supervisory Dependency: The reliance on frame-based photometric loss inherits brightness-constancy assumptions and occlusion sensitivity.

Suggested extensions focus on:

  • Incorporating event-only loss functions, such as motion-compensated event alignment, to obviate the need for grayscale supervision.
  • Broadening self-supervised paradigms to other event-based domains (e.g., depth, egomotion, semantic segmentation) using event-specific consistency or reconstruction constraints.
  • Developing richer, possibly learned, event tensor embeddings and spatio-temporal surfaces to encode more nuanced event histories.

7. Impact and Research Context

EV-FlowNet demonstrated that effective optical flow estimation from events can be achieved via a compact 4-channel summary and a multi-scale U-Net architecture trained with self-supervised photometric and smoothness losses (Zhu et al., 2018). Its approach outperformed frame-based optical flow baselines on event camera data and inspired sequential and neuromorphic network extensions (e.g., ConvLSTM, SNNs) supporting temporally dense and energy-efficient inference (Ponghiran et al., 2022). The continued development of event-driven learning paradigms, recurrent representations, and event-only supervisory signals is enabling responsive and power-efficient vision pipelines for high-speed and high-dynamic-range sensing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EV-FlowNet.