EV-FlowNet: Event-based Optical Flow

Updated 12 February 2026

EV-FlowNet is a self-supervised deep learning pipeline for estimating dense optical flow from event-based cameras using a compact event-tensor and U-Net-inspired encoder–decoder.
It employs multi-scale flow heads and self-supervised photometric and smoothness losses to refine predictions without requiring ground-truth flow.
Extensions with recurrent and spiking neural architectures enable temporally dense (up to 100 Hz) and energy-efficient flow estimation for high-speed vision.

EV-FlowNet is a self-supervised deep learning pipeline for dense optical flow estimation from event-based cameras, specifically designed to address the unique data modalities and constraints associated with asynchronous event streams. Its formulation includes a compact event-tensor encoding, a U-Net-inspired network architecture, and a self-supervised training regimen based on grayscale image supervision. EV-FlowNet served as the foundational model for subsequent sequential and spiking neural extensions that enabled temporally dense (e.g., 100 Hz) flow estimation and highly efficient low-power inference.

1. Event Data Representation

Event cameras emit asynchronous events $e = \{x, t, p\}$ , where $x = (u,v)$ is the pixel location, $t$ is the timestamp, and $p \in \{+1, -1\}$ indicates polarity (increasing or decreasing log-intensity). An event is triggered by a threshold change in log-intensity: $\log I(t+\Delta t) - \log I(t) \geq \theta$

EV-FlowNet aggregates the raw event stream over a window $[t_0, t_1]$ into a fixed-size, image-like tensor with four channels per pixel:

$C^+(x)$ : total count of positive events at $x$
$C^-(x)$ : total count of negative events at $x$
$T^+(x)$ : normalized timestamp of the last positive event at $x$
$T^-(x)$ : normalized timestamp of the last negative event at $x$

Formally,

$C^p(x) = \sum_{e \in \text{Events}} 1_{e.x = x \wedge e.p = p} \quad T^p(x) = \max_{e: e.x = x \wedge e.p = p} \frac{e.t - t_0}{t_1 - t_0}, \quad p \in \{+1, -1\}$

This encoding preserves spatial locality, event frequency, and recency, allowing downstream convolutional networks to process events as pseudo-images. Both $T^+$ and $T^-$ are clipped to $[0,1]$ for range compatibility.

2. Network Architecture

EV-FlowNet utilizes a multi-scale encoder–decoder structure resembling U-Net to predict per-pixel 2D flow vectors $(u(x), v(x))$ . The architecture includes:

Encoder: Four strided 2D convolutions ( $4 \times 4$ , stride 2, padding 1), doubling channels at each stage: $4 \rightarrow 64 \rightarrow 128 \rightarrow 256 \rightarrow 512$ .
Bottleneck: Two residual blocks, both with 512 channels:

$x \mapsto \operatorname{ReLU}(\operatorname{Conv}_{3 \times 3}(x)) \mapsto \operatorname{ReLU}(\operatorname{Conv}_{3 \times 3}(\cdot)) + x$

Decoder: Four upconvolutional stages. Each upsamples by $2 \times$ , applies a $3 \times 3$ conv, and halves channels. Skip connections from encoder stages are concatenated (“ $\oplus$ ”).
Multi-scale Flow Heads: At each decoder level, a $2$-channel “flow head” predicts flow $(u_i, v_i)$ for that scale, passed through $\tanh$ . Multi-scale flows are upsampled for loss computation and concatenated into subsequent decoder operations for refinement.

The forward pass can be summarized as: $\begin{align*} &E_0 = \mathrm{InputEventImage} \in \mathbb{R}^{H \times W \times 4} \ &E_1 = \mathrm{Conv}_{4\times4, s2}(E_0) \ &E_2 = \mathrm{Conv}_{4\times4, s2}(E_1) \ &E_3 = \mathrm{Conv}_{4\times4, s2}(E_2) \ &E_4 = \mathrm{Conv}_{4\times4, s2}(E_3) \ &B = \mathrm{ResBlock}_{512}(\mathrm{ResBlock}_{512}(E_4)) \ &D_1 = \mathrm{UpConv}_{3\times3}(B) \oplus E_4,~ F_1 = \mathrm{FlowHead}(D_1) \ &... \end{align*}$

3. Self-Supervised Loss Functions

EV-FlowNet is trained without access to ground-truth flow, instead using a self-supervised approach leveraging frame-based grayscale images synchronized with the events.

Photometric Loss: The network-predicted flow is used to warp frame $I_{t+1}$ to $I_t$ , with a per-pixel brightness constancy loss:

$\ell_{\text{photo}} = \sum_{x,y} \rho \left( I_t(x,y) - I_{t+1}(x + u(x,y), y + v(x,y)) \right)$

The penalty $\rho$ uses the generalized Charbonnier function:

$\rho(s) = (s^2 + \epsilon^2)^\alpha, \quad \alpha = 0.45,~ \epsilon = 10^{-3}$

Smoothness Loss: A first-order neighborhood term penalizes local flow variation:

$\ell_{\text{smooth}} = \sum_{x,y} \sum_{(i,j) \in \mathcal{N}(x,y)} \rho \left( u(x,y) - u(i,j) \right) + \rho \left( v(x,y) - v(i,j) \right)$

where $\mathcal{N}(x,y)$ denotes 8-connected neighboring pixels.

Total Multi-scale Loss: Each intermediate and final flow estimate $F_i$ is supervised by downsampled frames, with the total loss

$L_{\mathrm{total}} = \sum_{i=1}^4 \left[ \ell_{\mathrm{photo}}(F_i) + \lambda\,\ell_{\mathrm{smooth}}(F_i) \right],\quad \lambda = 0.5$

4. Training and Evaluation Protocol

Training Pipeline:

Data: MVSEC “outdoor_day1” and “outdoor_day2” (DAVIS sensor: events, frames, pose/depth ground truth).
Event window: For $\Delta t = k$ frames, all events in $[t, t_k]$ are aggregated for input.
Augmentation: Random horizontal flips, crop to $256 \times 256$ .
Optimization: Adam, initial $1 \times 10^{-5}$ , exponential decay (0.8 per 4 epochs), Charbonnier with $\alpha=0.45$ , $\epsilon=1e-3$ , smoothness $\lambda=0.5$ , trained for 300k iterations ( $\approx$ 12 hours on V100 16 GB).

Evaluation Metrics:

Average Endpoint Error (AEE):

$\mathrm{AEE} = \frac{1}{|\Omega|} \sum_{x \in \Omega} \left\| (u_{\text{pred}}, v_{\text{pred}}) - (u_{\text{gt}}, v_{\text{gt}}) \right\|_2$

( $\Omega$ is the set of active, valid pixels)

Outlier Rate: $\%$ of pixels with EE $>$ 3 px and $>$ 5% of magnitude.

Ground-truth Flow: Generated from pose/depth via: $\begin{pmatrix} \dot{x} \ \dot{y} \end{pmatrix} = \begin{bmatrix} -\frac{1}{Z} & 0 & -\frac{x}{Z} & xy & -(1+x^2) & y \ 0 & -\frac{1}{Z} & \frac{y}{Z} & 1+y^2 & -xy & -x \end{bmatrix} \begin{pmatrix} v \ \omega \end{pmatrix}$

Results (MVSEC outdoor_day1, dt=1 and dt=4):

Model	AEE (dt=1)	Outliers (dt=1)	AEE (dt=4)	Outliers (dt=4)
UnFlow	0.97 px	1.6%	2.95 px	40.0%
EV-FlowNet $_{2R}$	0.49 px	0.2%	1.23 px	7.3%

EV-FlowNet achieves lower AEE and outlier rates than image-based UnFlow, especially at larger $\Delta t$ .

5. Extensions: Temporally Dense and Sequential Flow Estimation

Recent developments have extended the EV-FlowNet paradigm to recurrent neural architectures for temporally dense (e.g., 100 Hz) flow estimation (Ponghiran et al., 2022).

Sequential Event Slicing: Rather than aggregating all events over a fixed frame interval, slice the stream into fine-grained (e.g., 10 ms) windows, yielding input rates up to 100 Hz.
Recurrent Architectures:
- LSTM-FlowNet: A U-Net structure where all convolutions are replaced by ConvLSTM layers. Flow is predicted from the decoder's finest-scale hidden state at each timestep.
- EfficientSpike-FlowNet: All nonlinearities are replaced by leaky integrate-and-fire (LIF) neurons. Membrane potentials update as:
$v_t = v_{t-1} - y_{t-1} + W \ast x_t + b \quad y_t = \Theta(v_t - V_{th})$

where $\Theta$ is the Heaviside step function.
Loss Function: Pure supervised $\ell_2$ loss on flow against (possibly linearly interpolated) ground truth:

$L = \sum_{\text{frames } m} \sum_{\text{pixels } i \in \text{valid}} \left\| (u_{m,i}, v_{m,i})_{\text{pred}} - (u_{m,i}, v_{m,i})_{\text{gt}} \right\|_2^2$

Key Training Technique: Warm-up frames + truncated BPTT to stabilize long recurrent inference.

Comparison (DSEC dataset):

Model	Rate	AEE
EV-FlowNet (baseline)	10 Hz	0.67
LSTM-FlowNet (proposed)	100 Hz	0.60
EfficientSpike-FlowNet	100 Hz	2.66

LSTM-FlowNet demonstrates $\approx$ 13% lower AEE than baseline EV-FlowNet with $10\times$ higher temporal resolution.

Efficiency:

EV-FlowNet: $16.6$M parameters; normalized compute cost $1\times$
LSTM-FlowNet: $53.6$M parameters, uses %%%%51 $>$ 52%%%% energy (or $400\times$ at $100$ Hz)
EfficientSpike-FlowNet: $16.6$M parameters, only $1.5\%$ of LSTM energy at $100$ Hz, leveraging sparse event-driven computation.

6. Limitations and Prospective Directions

Identified limitations and potential enhancements include:

Timestamp Saturation: Rapid event bursts can override timestamp channels, reducing spatial discriminability.
Event Sparsity: If event window $\Delta t$ is too small, insufficient input signal may result.
Supervisory Dependency: The reliance on frame-based photometric loss inherits brightness-constancy assumptions and occlusion sensitivity.

Suggested extensions focus on:

Incorporating event-only loss functions, such as motion-compensated event alignment, to obviate the need for grayscale supervision.
Broadening self-supervised paradigms to other event-based domains (e.g., depth, egomotion, semantic segmentation) using event-specific consistency or reconstruction constraints.
Developing richer, possibly learned, event tensor embeddings and spatio-temporal surfaces to encode more nuanced event histories.

7. Impact and Research Context

EV-FlowNet demonstrated that effective optical flow estimation from events can be achieved via a compact 4-channel summary and a multi-scale U-Net architecture trained with self-supervised photometric and smoothness losses (Zhu et al., 2018). Its approach outperformed frame-based optical flow baselines on event camera data and inspired sequential and neuromorphic network extensions (e.g., ConvLSTM, SNNs) supporting temporally dense and energy-efficient inference (Ponghiran et al., 2022). The continued development of event-driven learning paradigms, recurrent representations, and event-only supervisory signals is enabling responsive and power-efficient vision pipelines for high-speed and high-dynamic-range sensing.

Markdown Upgrade to Chat

References (2)

Event-based Temporally Dense Optical Flow Estimation with Sequential Learning (2022)

EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EV-FlowNet.