Contrast Maximization for Event-based Vision

Updated 3 July 2026

Contrast Maximization (CM) is an event-based estimation framework that optimizes warp parameters to produce maximally sharp, motion-corrected event images.
It unifies tasks such as motion, depth, and optical flow estimation by modifying the parametric warp and evaluating sharpness through variance and gradient measures.
CM techniques employ diverse optimization strategies and regularizers to prevent event collapse and improve robustness in real-time vision applications.

Searching arXiv for recent and foundational papers on Contrast Maximization in event-based vision. Contrast Maximization (CM) is an event-based estimation framework in which candidate motion, depth, or flow parameters are used to warp events to a common reference time, and the parameters are chosen so that the resulting Image of Warped Events (IWE) is maximally sharp. In the unifying formulation of Gallego et al., the same principle applies to motion estimation, stereo depth, and optical flow by changing only the parametric warp $W$ ; the method implicitly handles data association and produces motion-corrected edge-like images with high dynamic range (Gallego et al., 2018).

1. Core mathematical formulation

In the standard event-camera setting, an event stream is written as

$e_k=(x_k,t_k,p_k),$

where $x_k\in\Omega\subset\mathbb R^2$ is the image location, $t_k$ is the timestamp, and $p_k\in\{+1,-1\}$ is the polarity. CM selects a reference time $t_{\rm ref}$ , warps each event according to a candidate parameter vector $\theta$ ,

$x_k'(\theta)=W(x_k,t_k;\theta),$

and rasterizes the warped events into an IWE,

$I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$

where $K$ is a kernel such as the Dirac $e_k=(x_k,t_k,p_k),$ 0 for pure binning or a tent for bilinear interpolation (Gallego et al., 2018).

The foundational CM objective is the variance of the IWE over the pixel grid: $e_k=(x_k,t_k,p_k),$ 1 Maximizing $e_k=(x_k,t_k,p_k),$ 2 sharpens the warped event image; equivalently, one may maximize $e_k=(x_k,t_k,p_k),$ 3 when $e_k=(x_k,t_k,p_k),$ 4 is constant (Gallego et al., 2018). In a hardware-oriented restatement, the same idea appears as maximizing the variance of $e_k=(x_k,t_k,p_k),$ 5 after bilinear voting of warped events (Filipkowski et al., 10 May 2026).

The literature also uses alternative sharpness measures. "Secrets of Edge-Informed Contrast Maximization for Event-Based Vision" distinguishes a zeroth-order variance contrast,

$e_k=(x_k,t_k,p_k),$ 6

from a first-order gradient-magnitude contrast,

$e_k=(x_k,t_k,p_k),$ 7

and defines a relative contrast $e_k=(x_k,t_k,p_k),$ 8 (Karmokar et al., 2024). For sequential optical flow, Paredes-Vallés et al. replace raw counts by per-polarity time-images with normalized timestamps,

$e_k=(x_k,t_k,p_k),$ 9

and minimize the sum of squared time-images over active pixels; smaller $x_k\in\Omega\subset\mathbb R^2$ 0 means sharper IWEs (Paredes-Vallés et al., 2023).

A persistent misconception is that CM is tied to a single image statistic. The surveyed formulations show instead that CM is a family of objectives built around the same alignment principle: candidate trajectories are evaluated by the sharpness, variance, gradient energy, or timestamp consistency of the warped event accumulation.

2. Warp models and task specialization

The unifying aspect of CM lies in the separation between the objective and the warp model. In the abstract formulation,

$x_k\in\Omega\subset\mathbb R^2$ 1

and the choice of $x_k\in\Omega\subset\mathbb R^2$ 2 determines the task. For optical flow, $x_k\in\Omega\subset\mathbb R^2$ 3 is a 2D flow vector $x_k\in\Omega\subset\mathbb R^2$ 4 and the warp is

$x_k\in\Omega\subset\mathbb R^2$ 5

For stereo depth, $x_k\in\Omega\subset\mathbb R^2$ 6 is inverse depth or disparity, and the warp reprojects an event from one camera into the other. For camera motion estimation, $x_k\in\Omega\subset\mathbb R^2$ 7 can be a six-dimensional twist $x_k\in\Omega\subset\mathbb R^2$ 8, and each event ray is moved by rigid-body motion before reprojection (Gallego et al., 2018).

In rotational motion estimation, a common model assumes constant angular velocity over a short interval. "Globally Optimal Contrast Maximisation for Event-based Motion Estimation" writes the warp as

$x_k\in\Omega\subset\mathbb R^2$ 9

with $t_k$ 0 constrained to a 3-ball $t_k$ 1 (Liu et al., 2020).

A second misconception is that CM is intrinsically two-dimensional. In fact, "Visual Odometry with an Event Camera Using Continuous Ray Warping and Volumetric Contrast Maximization" generalizes the framework to 3D. Instead of projecting all events into a single reference image plane, the method warps entire event rays under a continuous-time trajectory $t_k$ 2, accumulates them in a volumetric ray-density field,

$t_k$ 3

and maximizes the variance of that volume (Wang et al., 2021). The paper explicitly contrasts this with standard 2D IWE CM, which requires a homography or known per-event depth and cannot handle full 6-DOF in arbitrary scenes.

The same contrast principle has also been extended to back-end trajectory refinement. "CMax-SLAM" formulates rotation-only bundle adjustment by building a local IWE from current events, combining it with a global IWE from past events, and minimizing negative contrast plus a motion regularizer over spline control poses in $t_k$ 4 (Guo et al., 2024). In that formulation, the sharpness of a global panoramic IWE serves as a proxy for reprojection error.

3. Optimization strategies and search procedures

The canonical optimization problem is

$t_k$ 5

Gallego et al. describe two standard numerical strategies: gradient ascent, in which events are warped, rasterized, differentiated through the kernel, and used to update $t_k$ 6; and Gauss-Newton or Levenberg-Marquardt, which linearize residuals and solve approximate normal equations (Gallego et al., 2018). The same survey notes that contrast surfaces $t_k$ 7 can be non-convex, so a good initialization is important.

A rigorous response to this non-convexity is branch-and-bound. "Globally Optimal Contrast Maximisation for Event-based Motion Estimation" embeds rotational CM in a deterministic BnB framework over the rotation parameter domain. For each search cube $t_k$ 8, the method computes a lower bound from the best contrast found so far and an upper bound from new bounding functions for the contrast objective, then prunes cubes with $t_k$ 9 (Liu et al., 2020). The paper establishes the validity of these bounds for both continuous and discrete event images and reports exact global optimization for 3 DoF pure rotation.

At the other end of the spectrum are multiscale local solvers. In edge-informed CM, motion is estimated on a pyramid of $p_k\in\{+1,-1\}$ 0 levels from coarse to fine; at each level, the method optimizes

$p_k\in\{+1,-1\}$ 1

with full-batch BFGS via JAXopt, and optionally mixes the result with a downscaled previous-window estimate through a handover weight $p_k\in\{+1,-1\}$ 2 (Karmokar et al., 2024). The same paper states that, in practice, 5 pyramid levels from $p_k\in\{+1,-1\}$ 3 to $p_k\in\{+1,-1\}$ 4 flow cells converge reliably.

Sequential neural CM introduces a distinct optimization structure. In "Taming Contrast Maximization for Learning Sequential, Low-latency, Event-based Optical Flow," a stateful recurrent network processes $p_k\in\{+1,-1\}$ 5 small windows of duration $p_k\in\{+1,-1\}$ 6, predicts a flow map $p_k\in\{+1,-1\}$ 7 at each step, and computes a multi-reference, multi-timescale self-supervised loss over the resulting iterative warps (Paredes-Vallés et al., 2023). This formulation shifts the numerical burden from per-window test-time optimization to training-time differentiation through a recurrent predictor.

4. Failure modes, regularization, and robustness

The best-known pathology of CM is event collapse: a degenerate optimum in which the estimated warp contracts events into too few pixels and thereby increases IWE contrast spuriously. "Event Collapse in Contrast Maximization Frameworks" demonstrates this in a 1-DOF zoom model,

$p_k\in\{+1,-1\}$ 8

for which $p_k\in\{+1,-1\}$ 9 drives events toward a single point even when the true motion corresponds to a different parameter value (Shiba et al., 2022). The paper derives collapse metrics from first principles using the divergence of the induced flow field and the determinant of the warp Jacobian, and augments CM with a weighted penalty

$t_{\rm ref}$ 0

The same study reports that these regularizers mitigate collapse and do not harm well-posed warps.

A computationally lighter remedy appears in "A Fast Geometric Regularizer to Mitigate Event Collapse in the Contrast Maximization Framework." For a 1-DOF zoom motion, the proposed penalty is

$t_{\rm ref}$ 1

which tends to $t_{\rm ref}$ 2 as $t_{\rm ref}$ 3 and therefore acts as a barrier against complete collapse (Shiba et al., 2022). The paper states that the regularizer is closed-form in $t_{\rm ref}$ 4, is essentially identical in runtime to unregularized CMax, and is two to four times faster than previous approaches while achieving state-of-the-art accuracy results in the reported settings.

Another failure mode is the appearance of multiple extrema under strong noise. "Density Invariant Contrast Maximization for Neuromorphic Earth Observations" attributes this to the geometry of warped noise: uniformly distributed noise events can accumulate into a trapezoidal or ring-shaped pattern after warping, which boosts variance at spurious motion parameters (Arja et al., 2023). The proposed analytical compensation multiplies the warped event image by a motion-dependent correction $t_{\rm ref}$ 5 before variance is computed,

$t_{\rm ref}$ 6

so that the variance contribution of uniform noise becomes zero for all $t_{\rm ref}$ 7. The paper emphasizes that $t_{\rm ref}$ 8 depends only on $t_{\rm ref}$ 9, $\theta$ 0, and the sensor size, and keeps the rest of the CMax pipeline unchanged.

A further robustness extension is bi-modal rather than event-only. "Secrets of Edge-Informed Contrast Maximization for Event-Based Vision" introduces a frame-edge correlation term,

$\theta$ 1

where $\theta$ 2 measures the discrepancy between the IWE and an edge map extracted from a synchronous intensity frame (Karmokar et al., 2024). The combined objective adds this correlation to the relative contrast and a smoothness regularizer. The paper reports that adding edge-frame correlation improves IWE sharpness and helps avoid event-collapse modes.

5. Learning-based CM and continuous-time reformulations

CM has become a central self-supervised principle for event-based optical flow. "Taming Contrast Maximization for Learning Sequential, Low-latency, Event-based Optical Flow" replaces the one-shot linear warp by iterative warping over $\theta$ 3 small time steps,

$\theta$ 4

and averages the CM loss over all reference times and multiple temporal scales (Paredes-Vallés et al., 2023). The method uses a recurrent EV-FlowNet with ConvGRU modules, two-channel event-count images as input, and a full training loss equal to the multi-timescale CM objective with no additional smoothness priors. The paper reports that iterative warping versus single-step linear warp reduces EPE by up to 45%, multi-ref beats endpoint-only by about 25% EPE, and multi-scale eliminates the need to hand-tune $\theta$ 5. On DSEC-Flow, the single-scale setting with $\theta$ 6 and $\theta$ 7 reaches $\theta$ 8 px EPE; on MVSEC outdoor_day1, a transferred model achieves $\theta$ 9 px EPE and is only beaten by supervised E-RAFT at $x_k'(\theta)=W(x_k,t_k;\theta),$ 0 px.

A different learning-based extension injects a nonlinear trajectory prior. "Motion-prior Contrast Maximization for Dense Continuous-Time Motion Estimation" associates each pixel with a continuous-time trajectory

$x_k'(\theta)=W(x_k,t_k;\theta),$ 1

and warps each event through a soft $x_k'(\theta)=W(x_k,t_k;\theta),$ 2-nearest-neighbor assignment to nearby pixel trajectories (Hamann et al., 2024). The self-supervised loss is

$x_k'(\theta)=W(x_k,t_k;\theta),$ 3

where $x_k'(\theta)=W(x_k,t_k;\theta),$ 4 is the spatial-gradient magnitude of the IWE and $x_k'(\theta)=W(x_k,t_k;\theta),$ 5 penalizes the spatial gradient of the interpolated displacement field. The paper reports a 29% zero-shot improvement on EVIMO2 when self-supervised fine-tuning is added to a synthetically trained model, and states that, on DSEC optical flow, the method elevates a simple U-Net to state-of-the-art performance among self-supervised methods.

Recent work also questions whether pure CM is sufficient for continuous-time flow. "From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation" states two core limitations of pure CM: loss of temporal continuity, because all events are projected to a single $x_k'(\theta)=W(x_k,t_k;\theta),$ 6, and loss of structural coherence, because single-slice aggregation can over-sharpen some edges while mis-aligning others under complex motion (Hu et al., 25 May 2026). The proposed Spatio-temporal Structural Consistency (STSC) replaces the single IWE with a Volumetric Warped Events tensor and introduces Local Structural Consistency and Trajectory Continuity losses. A plausible implication is that CM is increasingly being treated not as a complete continuous-time solution, but as a strong alignment primitive that may need temporal and structural constraints in dense settings.

6. Systems, segmentation, and hardware acceleration

CM has also been extended beyond single-motion estimation. "Iterative Event-based Motion Segmentation by Variational Contrast Maximization" introduces a greedy segmentation scheme in which standard CM first estimates a dominant motion $x_k'(\theta)=W(x_k,t_k;\theta),$ 7, then classifies events according to the magnitude of the first variation of the contrast objective with respect to event position,

$x_k'(\theta)=W(x_k,t_k;\theta),$ 8

and iteratively removes the events supporting that motion (Yamaki et al., 25 Apr 2025). The paper reports state-of-the-art moving object detection benchmarks with an improvement of over 30%, as well as crisp, sharp motion-compensated edge-like images for the segmented clusters.

At the system level, "CMax-SLAM" combines a front-end that estimates angular velocity from local CM on event slices with a back-end bundle-adjustment stage that optimizes spline control poses using local and global IWEs (Guo et al., 2024). The paper describes the first event-based rotation-only bundle adjustment approach and the first event-based rotation-only SLAM system comprising a front-end and a back-end. On synthetic and real-world datasets, it reports reductions in a proxy event-area reprojection measure after bundle adjustment and runtime figures of $x_k'(\theta)=W(x_k,t_k;\theta),$ 9– $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 0 $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 1s/event in the front-end and $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 2– $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 3 $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 4s/event in the back-end on an Intel i7.

CM is now also a hardware target. "FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision" maps event warping, bilinear voting, variance computation, and gradient-ascent parameter updates to a deeply pipelined FPGA architecture with multi-bank BRAM, a three-stage accumulation pipeline, and a read-and-zero mechanism (Filipkowski et al., 10 May 2026). For batches of $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 5 events with $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 6 in-ROI events and $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 7 iterations, the paper reports $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 8 ms per batch on a Xilinx Kria KV260 at $I_i(\theta)=\sum_{k=1}^N K\bigl(x_i, x_k'(\theta)\bigr),$ 9 MHz, corresponding to approximately $K$ 0 M events/s, versus $K$ 1 ms on an Intel i5-11300H CPU and $K$ 2 ms on an Nvidia RTX 3050 Ti GPU. It describes this as the first hardware architecture enabling acceleration of CM algorithm computations and validates it in an event-based object-tracking application.

Taken together, these developments place CM in a broad methodological spectrum: a unifying objective for motion, depth, optical flow, segmentation, and SLAM; a source of principled but sometimes non-convex optimization problems; and an increasingly engineered component in recurrent learning systems, hybrid geometric-appearance models, volumetric estimators, and real-time embedded hardware.