Papers
Topics
Authors
Recent
Search
2000 character limit reached

EV-FlowNet: Event-based Optical Flow

Updated 12 February 2026
  • EV-FlowNet is a self-supervised deep learning pipeline for estimating dense optical flow from event-based cameras using a compact event-tensor and U-Net-inspired encoder–decoder.
  • It employs multi-scale flow heads and self-supervised photometric and smoothness losses to refine predictions without requiring ground-truth flow.
  • Extensions with recurrent and spiking neural architectures enable temporally dense (up to 100 Hz) and energy-efficient flow estimation for high-speed vision.

EV-FlowNet is a self-supervised deep learning pipeline for dense optical flow estimation from event-based cameras, specifically designed to address the unique data modalities and constraints associated with asynchronous event streams. Its formulation includes a compact event-tensor encoding, a U-Net-inspired network architecture, and a self-supervised training regimen based on grayscale image supervision. EV-FlowNet served as the foundational model for subsequent sequential and spiking neural extensions that enabled temporally dense (e.g., 100 Hz) flow estimation and highly efficient low-power inference.

1. Event Data Representation

Event cameras emit asynchronous events e={x,t,p}e = \{x, t, p\}, where x=(u,v)x = (u,v) is the pixel location, tt is the timestamp, and p{+1,1}p \in \{+1, -1\} indicates polarity (increasing or decreasing log-intensity). An event is triggered by a threshold change in log-intensity: logI(t+Δt)logI(t)θ\log I(t+\Delta t) - \log I(t) \geq \theta

EV-FlowNet aggregates the raw event stream over a window [t0,t1][t_0, t_1] into a fixed-size, image-like tensor with four channels per pixel:

  • C+(x)C^+(x): total count of positive events at xx
  • C(x)C^-(x): total count of negative events at xx
  • T+(x)T^+(x): normalized timestamp of the last positive event at xx
  • T(x)T^-(x): normalized timestamp of the last negative event at xx

Formally,

Cp(x)=eEvents1e.x=xe.p=pTp(x)=maxe:e.x=xe.p=pe.tt0t1t0,p{+1,1}C^p(x) = \sum_{e \in \text{Events}} 1_{e.x = x \wedge e.p = p} \quad T^p(x) = \max_{e: e.x = x \wedge e.p = p} \frac{e.t - t_0}{t_1 - t_0}, \quad p \in \{+1, -1\}

This encoding preserves spatial locality, event frequency, and recency, allowing downstream convolutional networks to process events as pseudo-images. Both T+T^+ and TT^- are clipped to [0,1][0,1] for range compatibility.

2. Network Architecture

EV-FlowNet utilizes a multi-scale encoder–decoder structure resembling U-Net to predict per-pixel 2D flow vectors (u(x),v(x))(u(x), v(x)). The architecture includes:

  • Encoder: Four strided 2D convolutions (4×44 \times 4, stride 2, padding 1), doubling channels at each stage: 4641282565124 \rightarrow 64 \rightarrow 128 \rightarrow 256 \rightarrow 512.
  • Bottleneck: Two residual blocks, both with 512 channels:

xReLU(Conv3×3(x))ReLU(Conv3×3())+xx \mapsto \operatorname{ReLU}(\operatorname{Conv}_{3 \times 3}(x)) \mapsto \operatorname{ReLU}(\operatorname{Conv}_{3 \times 3}(\cdot)) + x

  • Decoder: Four upconvolutional stages. Each upsamples by 2×2 \times, applies a 3×33 \times 3 conv, and halves channels. Skip connections from encoder stages are concatenated (“\oplus”).
  • Multi-scale Flow Heads: At each decoder level, a $2$-channel “flow head” predicts flow (ui,vi)(u_i, v_i) for that scale, passed through tanh\tanh. Multi-scale flows are upsampled for loss computation and concatenated into subsequent decoder operations for refinement.

The forward pass can be summarized as: E0=InputEventImageRH×W×4 E1=Conv4×4,s2(E0) E2=Conv4×4,s2(E1) E3=Conv4×4,s2(E2) E4=Conv4×4,s2(E3) B=ResBlock512(ResBlock512(E4)) D1=UpConv3×3(B)E4, F1=FlowHead(D1) ...\begin{align*} &E_0 = \mathrm{InputEventImage} \in \mathbb{R}^{H \times W \times 4} \ &E_1 = \mathrm{Conv}_{4\times4, s2}(E_0) \ &E_2 = \mathrm{Conv}_{4\times4, s2}(E_1) \ &E_3 = \mathrm{Conv}_{4\times4, s2}(E_2) \ &E_4 = \mathrm{Conv}_{4\times4, s2}(E_3) \ &B = \mathrm{ResBlock}_{512}(\mathrm{ResBlock}_{512}(E_4)) \ &D_1 = \mathrm{UpConv}_{3\times3}(B) \oplus E_4,~ F_1 = \mathrm{FlowHead}(D_1) \ &... \end{align*}

3. Self-Supervised Loss Functions

EV-FlowNet is trained without access to ground-truth flow, instead using a self-supervised approach leveraging frame-based grayscale images synchronized with the events.

  • Photometric Loss: The network-predicted flow is used to warp frame It+1I_{t+1} to ItI_t, with a per-pixel brightness constancy loss:

photo=x,yρ(It(x,y)It+1(x+u(x,y),y+v(x,y)))\ell_{\text{photo}} = \sum_{x,y} \rho \left( I_t(x,y) - I_{t+1}(x + u(x,y), y + v(x,y)) \right)

The penalty ρ\rho uses the generalized Charbonnier function:

ρ(s)=(s2+ϵ2)α,α=0.45, ϵ=103\rho(s) = (s^2 + \epsilon^2)^\alpha, \quad \alpha = 0.45,~ \epsilon = 10^{-3}

  • Smoothness Loss: A first-order neighborhood term penalizes local flow variation:

smooth=x,y(i,j)N(x,y)ρ(u(x,y)u(i,j))+ρ(v(x,y)v(i,j))\ell_{\text{smooth}} = \sum_{x,y} \sum_{(i,j) \in \mathcal{N}(x,y)} \rho \left( u(x,y) - u(i,j) \right) + \rho \left( v(x,y) - v(i,j) \right)

where N(x,y)\mathcal{N}(x,y) denotes 8-connected neighboring pixels.

  • Total Multi-scale Loss: Each intermediate and final flow estimate FiF_i is supervised by downsampled frames, with the total loss

Ltotal=i=14[photo(Fi)+λsmooth(Fi)],λ=0.5L_{\mathrm{total}} = \sum_{i=1}^4 \left[ \ell_{\mathrm{photo}}(F_i) + \lambda\,\ell_{\mathrm{smooth}}(F_i) \right],\quad \lambda = 0.5

4. Training and Evaluation Protocol

Training Pipeline:

  • Data: MVSEC “outdoor_day1” and “outdoor_day2” (DAVIS sensor: events, frames, pose/depth ground truth).
  • Event window: For Δt=k\Delta t = k frames, all events in [t,tk][t, t_k] are aggregated for input.
  • Augmentation: Random horizontal flips, crop to 256×256256 \times 256.
  • Optimization: Adam, initial 1×1051 \times 10^{-5}, exponential decay (0.8 per 4 epochs), Charbonnier with α=0.45\alpha=0.45, ϵ=1e3\epsilon=1e-3, smoothness λ=0.5\lambda=0.5, trained for 300k iterations (\approx12 hours on V100 16 GB).

Evaluation Metrics:

  • Average Endpoint Error (AEE):

AEE=1ΩxΩ(upred,vpred)(ugt,vgt)2\mathrm{AEE} = \frac{1}{|\Omega|} \sum_{x \in \Omega} \left\| (u_{\text{pred}}, v_{\text{pred}}) - (u_{\text{gt}}, v_{\text{gt}}) \right\|_2

(Ω\Omega is the set of active, valid pixels)

  • Outlier Rate: %\% of pixels with EE >> 3 px and >>5% of magnitude.

Ground-truth Flow: Generated from pose/depth via: (x˙ y˙)=[1Z0xZxy(1+x2)y 01ZyZ1+y2xyx](v ω)\begin{pmatrix} \dot{x} \ \dot{y} \end{pmatrix} = \begin{bmatrix} -\frac{1}{Z} & 0 & -\frac{x}{Z} & xy & -(1+x^2) & y \ 0 & -\frac{1}{Z} & \frac{y}{Z} & 1+y^2 & -xy & -x \end{bmatrix} \begin{pmatrix} v \ \omega \end{pmatrix}

Results (MVSEC outdoor_day1, dt=1 and dt=4):

Model AEE (dt=1) Outliers (dt=1) AEE (dt=4) Outliers (dt=4)
UnFlow 0.97 px 1.6% 2.95 px 40.0%
EV-FlowNet2R_{2R} 0.49 px 0.2% 1.23 px 7.3%

EV-FlowNet achieves lower AEE and outlier rates than image-based UnFlow, especially at larger Δt\Delta t.

5. Extensions: Temporally Dense and Sequential Flow Estimation

Recent developments have extended the EV-FlowNet paradigm to recurrent neural architectures for temporally dense (e.g., 100 Hz) flow estimation (Ponghiran et al., 2022).

  • Sequential Event Slicing: Rather than aggregating all events over a fixed frame interval, slice the stream into fine-grained (e.g., 10 ms) windows, yielding input rates up to 100 Hz.
  • Recurrent Architectures:

    • LSTM-FlowNet: A U-Net structure where all convolutions are replaced by ConvLSTM layers. Flow is predicted from the decoder's finest-scale hidden state at each timestep.
    • EfficientSpike-FlowNet: All nonlinearities are replaced by leaky integrate-and-fire (LIF) neurons. Membrane potentials update as:

    vt=vt1yt1+Wxt+byt=Θ(vtVth)v_t = v_{t-1} - y_{t-1} + W \ast x_t + b \quad y_t = \Theta(v_t - V_{th})

    where Θ\Theta is the Heaviside step function.

  • Loss Function: Pure supervised 2\ell_2 loss on flow against (possibly linearly interpolated) ground truth:

L=frames mpixels ivalid(um,i,vm,i)pred(um,i,vm,i)gt22L = \sum_{\text{frames } m} \sum_{\text{pixels } i \in \text{valid}} \left\| (u_{m,i}, v_{m,i})_{\text{pred}} - (u_{m,i}, v_{m,i})_{\text{gt}} \right\|_2^2

  • Key Training Technique: Warm-up frames + truncated BPTT to stabilize long recurrent inference.

Comparison (DSEC dataset):

Model Rate AEE
EV-FlowNet (baseline) 10 Hz 0.67
LSTM-FlowNet (proposed) 100 Hz 0.60
EfficientSpike-FlowNet 100 Hz 2.66

LSTM-FlowNet demonstrates \approx 13% lower AEE than baseline EV-FlowNet with 10×10\times higher temporal resolution.

Efficiency:

  • EV-FlowNet: $16.6$M parameters; normalized compute cost 1×1\times
  • LSTM-FlowNet: $53.6$M parameters, uses %%%%51>>52%%%% energy (or 400×400\times at $100$ Hz)
  • EfficientSpike-FlowNet: $16.6$M parameters, only 1.5%1.5\% of LSTM energy at $100$ Hz, leveraging sparse event-driven computation.

6. Limitations and Prospective Directions

Identified limitations and potential enhancements include:

  • Timestamp Saturation: Rapid event bursts can override timestamp channels, reducing spatial discriminability.
  • Event Sparsity: If event window Δt\Delta t is too small, insufficient input signal may result.
  • Supervisory Dependency: The reliance on frame-based photometric loss inherits brightness-constancy assumptions and occlusion sensitivity.

Suggested extensions focus on:

  • Incorporating event-only loss functions, such as motion-compensated event alignment, to obviate the need for grayscale supervision.
  • Broadening self-supervised paradigms to other event-based domains (e.g., depth, egomotion, semantic segmentation) using event-specific consistency or reconstruction constraints.
  • Developing richer, possibly learned, event tensor embeddings and spatio-temporal surfaces to encode more nuanced event histories.

7. Impact and Research Context

EV-FlowNet demonstrated that effective optical flow estimation from events can be achieved via a compact 4-channel summary and a multi-scale U-Net architecture trained with self-supervised photometric and smoothness losses (Zhu et al., 2018). Its approach outperformed frame-based optical flow baselines on event camera data and inspired sequential and neuromorphic network extensions (e.g., ConvLSTM, SNNs) supporting temporally dense and energy-efficient inference (Ponghiran et al., 2022). The continued development of event-driven learning paradigms, recurrent representations, and event-only supervisory signals is enabling responsive and power-efficient vision pipelines for high-speed and high-dynamic-range sensing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EV-FlowNet.