Tri-frame Optical Flow (TROF)

Updated 16 March 2026

Tri-frame Optical Flow (TROF) is a multi-frame estimation method that utilizes triplet data to jointly infer bi-directional motion.
It employs analytic weighted averaging for event data or recurrent neural fusion for RGB frames to resolve motion ambiguities and occlusions.
The approach improves accuracy and robustness over two-frame models, achieving state-of-the-art performance on benchmarks like MVSEC, Sintel, and KITTI.

Tri-frame Optical Flow (TROF) refers to optical flow estimation methods that leverage information from three consecutive frames or events to jointly infer bi-directional motion centered on the middle frame. Two notable implementations of this concept are seen in the event-based Triplet-Matching Optical Flow algorithm (Shiba et al., 2022) and the RGB frame-based TROF module at the heart of VideoFlow (Shi et al., 2023). Both use three samples (frames or events) to resolve ambiguity, enhance accuracy, and improve occlusion handling beyond conventional two-frame or packet-based schemes.

1. Definition and Core Principles

Tri-frame Optical Flow (TROF) denotes an architectural and algorithmic design that utilizes triplets of temporally adjacent observations to improve optical flow estimation. Rather than treating frame pairs (or event packets) in isolation, TROF explicitly ties together past, present, and future information:

Bi-directional flow estimation: For a center frame $I_t$ (or event $e_k$ ), both backward ( $f_{t \to t-1}$ ) and forward ( $f_{t \to t+1}$ ) flows are estimated jointly, sharing origin and allowing co-consistent motion reasoning.
Triplet association: Detection of correspondences forms a three-point structure in space-time, enforcing collinearity and constant-velocity hypotheses at a local spatial-temporal scale.
Iterative or analytic fusion: Flow information from both temporal directions is fused, either through iterative neural refinement (as in RGB-based VideoFlow TROF) or analytic averaging of velocity estimates (as in event-based TROF).

This three-point approach reduces ambiguities, improves robustness in occluded or textureless areas, and supports explicit bi-directional consistency.

2. Mathematical Formulation and Algorithmic Structure

Event-based TROF

In Fast Event-based Optical Flow Estimation by Triplet Matching (Shiba et al., 2022), the input is a chronologically ordered stream of events $e_k = (t_k, x_k, p_k)$ from an event camera, where $t_k$ is timestamp, $x_k$ is 2D location, and $p_k$ is polarity.

Triplet Construction:

The algorithm searches for prior two events $e_i$ and $e_j$ forming a triplet $(e_k, e_i, e_j)$ : - $e_i$ is a spatio-temporal neighbor of $e_k$ , constrained by

$H_k = \left\{ i \mid t_k - \tau - d_t \leq t_i \leq t_k - \tau\quad\mathrm{and}\quad \|x_k - x_i\|_1 \leq d_x \right\}$

$e_j$ is a neighbor of $e_i$ satisfying

$J_{ki} = \left\{ j \in H_i \mid t_i - \tau - d_t \leq t_j \leq t_i - \tau,\quad x_i - x_j = x_k - x_i \right\}$

imposing spatial and temporal collinearity under constant velocity.

Velocity and Weighting: For each triplet, the local velocity is

$v_T = \frac{x_j - x_k}{t_j - t_k}$

and is weighted by a Gaussian based on timing prediction:

$w_T = \mathcal{N}(t_j; \hat t_j, \delta^2),\;\;\hat t_j = t_i - \delta,\,\delta = t_k - t_i$

The event-wise flow estimate is the weighted average:

$f_k = \frac{\sum_{T\ni k} w_T v_T}{\sum_{T\ni k} w_T}$

RGB Frame-based TROF (VideoFlow)

The TROF module in VideoFlow (Shi et al., 2023) processes three consecutive RGB frames $I_{t-1}, I_t, I_{t+1}$ . The joint architecture operates as follows:

Feature Extraction: Each frame is passed through a shared CNN backbone to produce per-frame features $G_{t-1}, G_t, G_{t+1} \in \mathbb{R}^{H\times W\times D}$ .
Cost Volume Construction: All-pairs correlation volumes encode similarities between the center frame and the adjacent frames:

$\mathrm{Corr}_{t,t-1}(x, y) = \langle G_t(x), G_{t-1}(y) \rangle$

Recurrent Fusion: The TROF module maintains two flows (backward and forward) per center pixel, referentially aligned and iteratively refined using a recurrent unit (SKBlock). At each iteration, local patches of the cost volume and flow are encoded, fused by a small CNN, and the recurrent state is updated. Flows are predicted as:

$f^{k+1} = f^k + \Delta f^k$

with bi-directional outputs $f_{t\to t-1}$ and $f_{t\to t+1}$ jointly regressed.

3. Architectural Features and Design Choices

Feature	Event-based TROF (Shiba et al., 2022)	RGB-based TROF (Shi et al., 2023)
Input	Stream of brightness-change events	Three consecutive RGB frames
Spatial granularity	Pixel-level, event-scale	Feature map, $1/8$ downsampling
Temporal modeling	µs precision, event-by-event	Frame sequence, context window
Flow representation	Explicit per-event velocities	Dense, bi-directional ( $f_{t\to t\pm1}$ )
Refinement	Weighted analytic averaging	Recurrent neural fusion (12 steps)
Parallelization	Inner triplet search loop	Patch-sampling & CNNs per iteration
Occlusion handling	Consistent triplet association, bidirectional	Iterative, context-fused flow

Distinctive design aspects in RGB-based TROF include dual cost volumes, joint alignment of bi-directional features, and recurrent refinement via SKBlocks and lightweight encoders. Event-based TROF emphasizes analytic, non-iterative explicit flow with minimal bookkeeping and a focus on causality and speed.

4. Empirical Performance and Evaluation

The event-based TROF method achieves high throughput and competitive accuracy on standard neuromorphic benchmarks:

MVSEC (outdoor_day1, $\Delta t=1$ ): AEE ≈ 0.94 px, Out>3px ≈ 3.1%
Comparative: MultiCM (packet-based, model-based) achieves AEE ≈ 0.30 px on same benchmark.
Computational efficiency: On an 8-core CPU (Apple M1), >10 kHz event processing, $>0.093$ ms per 300k events (Shiba et al., 2022).

Qualitatively, TROF produces sharp event-images under warping, is reliable along high-contrast edges, but less constrained in homogeneous areas. Errors increase ( $\sim4\times$ ) with longer $\Delta t$ windows due to the constant-velocity assumption.

The RGB-based TROF within VideoFlow demonstrates state-of-the-art performance:

Sintel (final/clean): 1.649 / 0.991 AEPE, representing error reductions of 15.1% (final) and 7.6% (clean) compared to the best-published (1.943/1.073 from FlowFormer++).
KITTI-2015: F1-all error 3.65%, a 19.2% reduction relative to the prior state-of-the-art (4.52% from FlowFormer++) (Shi et al., 2023).

Ablation studies show that joint bi-directional estimation and recurrent fusion yield reductions in endpoint error across all datasets when compared to two-frame baselines.

5. Assumptions, Limitations, and Practical Considerations

Both event-based and RGB-based TROF approaches impose locality and motion consistency constraints:

Local constant-velocity (linearity) assumption: Accurate within small triplet apertures, but degrades under highly nonuniform or large-scale motions.
Quantized directions (event-based): Search restricted to integer-pixel offsets in eight cardinal directions.
Refractory period: Ensures no two edge-events of the same polarity co-occur within a temporal margin ( $\tau$ ).

Event-based TROF’s accuracy is sensitive to edge density and local motion statistics. In cases of low texture, ambiguous associations may remain under-constrained. RGB-based TROF, by contrast, leverages high-capacity neural representations and cost volumes, allowing it to better handle complex motion, occlusions, and global scene structure via iterative cross-frame fusion.

Implementation details such as time-sorted maps, polarity separation, sliding windows, and strict per-event sequentiality (event-based TROF) or context mechanism and SKBlock updates (RGB-based TROF) are critical to real-world deployment and parallelization potential.

6. Connections to Broader Multi-frame and Event-based Flow Estimation

TROF extends two-frame approaches by introducing explicit temporal context from both before and after the estimation instant. In event-based vision, it marks an event-by-event, nearly real-time alternative to packet-based algorithms, dispensing with buffering or iterative optimization. The RGB-based formulation, situated at VideoFlow’s core, demonstrates how triplet-wise motion estimation can be combined and propagated across longer temporal sequences via additional modules (e.g., MOtion Propagation, MOP), with the triplet structure forming the atomic motion estimator.

A plausible implication is that tri-frame (or triplet-based) reasoning may offer a systematic route to improving optical flow estimation when temporal ambiguity, occlusion, or lack of local evidence hampers accuracy in conventional pairwise settings.

7. Summary and Significance

Tri-frame Optical Flow (TROF) unifies a set of algorithms and neural architectures characterized by the use of triplet temporal correspondences to enhance optical flow estimation. It is applicable both to sparse, asynchronous event streams and dense RGB video frames. Combining bi-directional reasoning, explicit triplet association, and either analytic or neural fusion mechanisms, TROF offers improved accuracy, robustness to occlusion, and computational efficiency. As evidenced in benchmark evaluations, these methods set new baselines for event-based motion estimation and multi-frame video flow (Shiba et al., 2022, Shi et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Fast Event-based Optical Flow Estimation by Triplet Matching (2022)

VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TRi-frame Optical Flow (TROF).