Tri-frame Optical Flow (TROF)
- Tri-frame Optical Flow (TROF) is a multi-frame estimation method that utilizes triplet data to jointly infer bi-directional motion.
- It employs analytic weighted averaging for event data or recurrent neural fusion for RGB frames to resolve motion ambiguities and occlusions.
- The approach improves accuracy and robustness over two-frame models, achieving state-of-the-art performance on benchmarks like MVSEC, Sintel, and KITTI.
Tri-frame Optical Flow (TROF) refers to optical flow estimation methods that leverage information from three consecutive frames or events to jointly infer bi-directional motion centered on the middle frame. Two notable implementations of this concept are seen in the event-based Triplet-Matching Optical Flow algorithm (Shiba et al., 2022) and the RGB frame-based TROF module at the heart of VideoFlow (Shi et al., 2023). Both use three samples (frames or events) to resolve ambiguity, enhance accuracy, and improve occlusion handling beyond conventional two-frame or packet-based schemes.
1. Definition and Core Principles
Tri-frame Optical Flow (TROF) denotes an architectural and algorithmic design that utilizes triplets of temporally adjacent observations to improve optical flow estimation. Rather than treating frame pairs (or event packets) in isolation, TROF explicitly ties together past, present, and future information:
- Bi-directional flow estimation: For a center frame (or event ), both backward () and forward () flows are estimated jointly, sharing origin and allowing co-consistent motion reasoning.
- Triplet association: Detection of correspondences forms a three-point structure in space-time, enforcing collinearity and constant-velocity hypotheses at a local spatial-temporal scale.
- Iterative or analytic fusion: Flow information from both temporal directions is fused, either through iterative neural refinement (as in RGB-based VideoFlow TROF) or analytic averaging of velocity estimates (as in event-based TROF).
This three-point approach reduces ambiguities, improves robustness in occluded or textureless areas, and supports explicit bi-directional consistency.
2. Mathematical Formulation and Algorithmic Structure
Event-based TROF
In Fast Event-based Optical Flow Estimation by Triplet Matching (Shiba et al., 2022), the input is a chronologically ordered stream of events from an event camera, where is timestamp, is 2D location, and is polarity.
- Triplet Construction:
The algorithm searches for prior two events and forming a triplet : - is a spatio-temporal neighbor of , constrained by
- is a neighbor of satisfying
imposing spatial and temporal collinearity under constant velocity.
- Velocity and Weighting: For each triplet, the local velocity is
and is weighted by a Gaussian based on timing prediction:
The event-wise flow estimate is the weighted average:
RGB Frame-based TROF (VideoFlow)
The TROF module in VideoFlow (Shi et al., 2023) processes three consecutive RGB frames . The joint architecture operates as follows:
- Feature Extraction: Each frame is passed through a shared CNN backbone to produce per-frame features .
- Cost Volume Construction: All-pairs correlation volumes encode similarities between the center frame and the adjacent frames:
- Recurrent Fusion: The TROF module maintains two flows (backward and forward) per center pixel, referentially aligned and iteratively refined using a recurrent unit (SKBlock). At each iteration, local patches of the cost volume and flow are encoded, fused by a small CNN, and the recurrent state is updated. Flows are predicted as:
with bi-directional outputs and jointly regressed.
3. Architectural Features and Design Choices
| Feature | Event-based TROF (Shiba et al., 2022) | RGB-based TROF (Shi et al., 2023) |
|---|---|---|
| Input | Stream of brightness-change events | Three consecutive RGB frames |
| Spatial granularity | Pixel-level, event-scale | Feature map, $1/8$ downsampling |
| Temporal modeling | µs precision, event-by-event | Frame sequence, context window |
| Flow representation | Explicit per-event velocities | Dense, bi-directional () |
| Refinement | Weighted analytic averaging | Recurrent neural fusion (12 steps) |
| Parallelization | Inner triplet search loop | Patch-sampling & CNNs per iteration |
| Occlusion handling | Consistent triplet association, bidirectional | Iterative, context-fused flow |
Distinctive design aspects in RGB-based TROF include dual cost volumes, joint alignment of bi-directional features, and recurrent refinement via SKBlocks and lightweight encoders. Event-based TROF emphasizes analytic, non-iterative explicit flow with minimal bookkeeping and a focus on causality and speed.
4. Empirical Performance and Evaluation
The event-based TROF method achieves high throughput and competitive accuracy on standard neuromorphic benchmarks:
- MVSEC (outdoor_day1, ): AEE ≈ 0.94 px, Out>3px ≈ 3.1%
- Comparative: MultiCM (packet-based, model-based) achieves AEE ≈ 0.30 px on same benchmark.
- Computational efficiency: On an 8-core CPU (Apple M1), >10 kHz event processing, ms per 300k events (Shiba et al., 2022).
Qualitatively, TROF produces sharp event-images under warping, is reliable along high-contrast edges, but less constrained in homogeneous areas. Errors increase () with longer windows due to the constant-velocity assumption.
The RGB-based TROF within VideoFlow demonstrates state-of-the-art performance:
- Sintel (final/clean): 1.649 / 0.991 AEPE, representing error reductions of 15.1% (final) and 7.6% (clean) compared to the best-published (1.943/1.073 from FlowFormer++).
- KITTI-2015: F1-all error 3.65%, a 19.2% reduction relative to the prior state-of-the-art (4.52% from FlowFormer++) (Shi et al., 2023).
Ablation studies show that joint bi-directional estimation and recurrent fusion yield reductions in endpoint error across all datasets when compared to two-frame baselines.
5. Assumptions, Limitations, and Practical Considerations
Both event-based and RGB-based TROF approaches impose locality and motion consistency constraints:
- Local constant-velocity (linearity) assumption: Accurate within small triplet apertures, but degrades under highly nonuniform or large-scale motions.
- Quantized directions (event-based): Search restricted to integer-pixel offsets in eight cardinal directions.
- Refractory period: Ensures no two edge-events of the same polarity co-occur within a temporal margin ().
Event-based TROF’s accuracy is sensitive to edge density and local motion statistics. In cases of low texture, ambiguous associations may remain under-constrained. RGB-based TROF, by contrast, leverages high-capacity neural representations and cost volumes, allowing it to better handle complex motion, occlusions, and global scene structure via iterative cross-frame fusion.
Implementation details such as time-sorted maps, polarity separation, sliding windows, and strict per-event sequentiality (event-based TROF) or context mechanism and SKBlock updates (RGB-based TROF) are critical to real-world deployment and parallelization potential.
6. Connections to Broader Multi-frame and Event-based Flow Estimation
TROF extends two-frame approaches by introducing explicit temporal context from both before and after the estimation instant. In event-based vision, it marks an event-by-event, nearly real-time alternative to packet-based algorithms, dispensing with buffering or iterative optimization. The RGB-based formulation, situated at VideoFlow’s core, demonstrates how triplet-wise motion estimation can be combined and propagated across longer temporal sequences via additional modules (e.g., MOtion Propagation, MOP), with the triplet structure forming the atomic motion estimator.
A plausible implication is that tri-frame (or triplet-based) reasoning may offer a systematic route to improving optical flow estimation when temporal ambiguity, occlusion, or lack of local evidence hampers accuracy in conventional pairwise settings.
7. Summary and Significance
Tri-frame Optical Flow (TROF) unifies a set of algorithms and neural architectures characterized by the use of triplet temporal correspondences to enhance optical flow estimation. It is applicable both to sparse, asynchronous event streams and dense RGB video frames. Combining bi-directional reasoning, explicit triplet association, and either analytic or neural fusion mechanisms, TROF offers improved accuracy, robustness to occlusion, and computational efficiency. As evidenced in benchmark evaluations, these methods set new baselines for event-based motion estimation and multi-frame video flow (Shiba et al., 2022, Shi et al., 2023).