Papers
Topics
Authors
Recent
Search
2000 character limit reached

TeFlow: Real-Time Multi-Frame Scene Flow Estimation

Updated 27 February 2026
  • TeFlow is a self-supervised algorithm that generates robust, temporally stable supervisory signals for real-time scene flow estimation from LiDAR data.
  • It employs a novel temporal ensembling strategy that aggregates motion cues from multi-frame LiDAR sequences to overcome occlusions and sensor noise.
  • Quantitative evaluations indicate that TeFlow achieves comparable accuracy to optimization-based methods while offering up to 150× speedup for practical autonomous driving applications.

TeFlow is a self-supervised algorithm for feed-forward scene flow estimation that enables multi-frame supervision by mining temporally consistent supervisory signals from sequences of LiDAR point clouds. Unlike traditional two-frame self-supervised approaches, which often fail under occlusions and sensor noise due to unreliable correspondences, TeFlow introduces a temporal ensembling strategy to form robust, temporally stable guidance signals for training feed-forward neural networks. This advancement permits real-time inference while achieving accuracy previously associated only with slower, optimization-based test-time approaches (Zhang et al., 22 Feb 2026).

1. Motivation and Context

Standard self-supervised scene flow estimation from LiDAR data typically relies on two-frame point correspondences between PtP_t and Pt+1P_{t+1}. This approach is fragile in the presence of occlusions, sensor noise, and sparsely observed objects (e.g., dynamic or articulated entities), leading to instability and degraded flow prediction. In contrast, multi-frame supervision utilizes motion cues from multiple temporal contexts, enhancing the stability and consistency of supervision, especially under challenging real-world conditions.

Feed-forward approaches train a unified neural network on large unlabeled datasets, producing scene flow predictions in a single forward pass and supporting real-time deployment. However, they have historically underperformed in dynamic or ambiguous scenarios due to reliance on only two-frame objectives. Optimization-based methods (such as NSFP, Floxels, EulerFlow) excel in accuracy by enforcing multi-frame constraints through expensive test-time iterative fitting, but at high computational cost, requiring minutes to hours per scene. TeFlow bridges this gap by providing multi-frame supervision in a feed-forward pipeline, achieving both high accuracy and low latency (Zhang et al., 22 Feb 2026).

2. Temporal Ensembling for Multi-frame Supervision

TeFlow's core innovation is its temporal ensembling strategy, which aggregates the most temporally consistent motion cues from a candidate pool spanning multiple frames. The algorithm maintains an explicit candidate pool of clusterwise motion hypotheses drawn from both the network's internal predictions and external geometric cues derived using nearest-neighbor searches across dynamic clusters over a temporal window.

Let PtRNt×3P_t \in \mathbb{R}^{N_t \times 3} denote the LiDAR point cloud at time tt, and CjPt,d\mathcal{C}_j \subset P_{t,d} the jjth dynamic cluster at tt. For each cluster:

  1. Internal candidate: f^Cj=1CjpiCjF^tt+1(pi)\hat f_{\mathcal C_j} = \frac{1}{|\mathcal C_j|}\sum_{p_i\in\mathcal C_j} \hat F_{t→t+1}(p_i).
  2. External candidates: For each frame t{th,,t1,t+1}t' \in \{t-h,\ldots,t-1,t+1\} and selected pkCjp_k \in \mathcal{C}_j with large displacement, set fCj,kt=NN(pk,Pt,d)pkttf^{t'}_{\mathcal{C}_j,k} = \frac{\mathrm{NN}(p_k,\,P_{t',d}) - p_k}{t' - t}, using Top-KK largest displacements.
  3. Candidate pool: Construct FCj={f^Cj}{fCj,kt}\mathcal F_{\mathcal C_j} = \{\hat f_{\mathcal C_j}\} \cup \{f^{t'}_{\mathcal C_j,k}\}.
  4. Directional consistency: Build matrix M\mathbf M with entries $1$ if cosine similarity between candidates exceeds a threshold τcos\tau_{\text{cos}}.
  5. Reliability weighting: wi=γmi(1+fi2)w_i = \gamma^{m_i}(1 + \|\mathbf f_i\|^2) for candidate ii at temporal distance mim_i.
  6. Consensus: Compute S=Mw\mathbf S = \mathbf M\,\mathbf w, select consensus index a=argmaxiSia^\dagger=\arg\max_i \mathbf S_i, and aggregate fˉCj\bar f_{\mathcal{C}_j} via reliability-weighted average over agreeing candidates.

This strategy robustly filters noisy or erroneous point correspondences by prioritizing temporally consistent cues, reliably supervising network training in the presence of occlusions and dynamic scenes.

3. Network Training and Loss Functions

TeFlow employs cluster-based and pointwise supervisory losses. The primary terms are:

  • Dynamic cluster loss (Ldcls\mathcal{L}_{\text{dcls}}):

Ldcls=1PCjpiCjF^tt+1(pi)fˉCj22+1Ncj(1CjpiCjF^tt+1(pi)fˉCj22)\mathcal{L}_{\text{dcls}} = \frac{1}{|\mathcal{P}_C|} \sum_j \sum_{p_i\in \mathcal{C}_j} \|\hat F_{t→t+1}(p_i) - \bar f_{\mathcal{C}_j}\|_2^2 + \frac{1}{N_c} \sum_j \left( \frac{1}{|\mathcal{C}_j|} \sum_{p_i\in \mathcal{C}_j} \|\hat F_{t→t+1}(p_i) - \bar f_{\mathcal{C}_j}\|_2^2 \right)

  • Static loss (Lstatic\mathcal{L}_{\text{static}}):

Lstatic=1Pt,spiPt,sF^tt+1(pi)1\mathcal{L}_{\text{static}} = \frac{1}{|P_{t,s}|} \sum_{p_i\in P_{t,s}} \|\hat F_{t→t+1}(p_i)\|_1

  • Geometric consistency loss (Lgeom\mathcal{L}_{\text{geom}}): Multi-frame Chamfer distances enforce that the warped PtP_t aligns with temporally neighboring frames.

The total training objective is:

Ltotal=Ldcls+λsLstatic+λgLgeom\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{dcls}} + \lambda_s\,\mathcal{L}_{\text{static}} + \lambda_g\,\mathcal{L}_{\text{geom}}

Empirical ablation analysis shows that dropping either Ldcls\mathcal{L}_{\text{dcls}} or Lstatic\mathcal{L}_{\text{static}} leads to substantial degradation in dynamic object accuracy; full loss composition is necessary for optimal performance (Zhang et al., 22 Feb 2026).

4. Computational Complexity and Inference Performance

TeFlow's inference consists of a single forward pass through a 4D-CNN or voxel-based backbone, such as ΔFlow, for each sequence. Temporal ensembling incurs a complexity of O(NcKhlogN)O(N_c K h \log N) per cluster (for nearest-neighbor search and matrix operations), which is negligible compared with the backbone's computation. Overall, the method achieves real-time inference.

By contrast, optimization-based approaches like Floxels and EulerFlow rely on per-scene iterative optimization, with runtime costs scaling as O(Niter×model_eval)O(N_{\text{iter}} \times \text{model\_eval}), where NiterN_{\text{iter}} can be 10410^410510^5. Reported runtimes for optimization-based methods range from 12 minutes (FastNSF) to 24 minutes (Floxels) to 24 hours (EulerFlow) per scene.

TeFlow achieves substantial speed improvement: inference on Argoverse 2 sequences (five-frame input) requires 8 s per sequence (\sim157 frames) on ten NVIDIA RTX 3080 GPUs; training takes 15–20 h. On nuScenes, inference time is 7 s per \sim200-frame sequence, representing a 150×150\times speedup over optimization-based state of the art (Zhang et al., 22 Feb 2026).

5. Quantitative and Qualitative Evaluation

Extensive evaluation on the Argoverse 2 and nuScenes datasets demonstrates competitive accuracy and efficiency.

Method # Frames Runtime 3-way EPE (cm) ↓ Dyn. Norm. EPE ↓
Floxels† 13 24 m 3.57 0.154
SeFlow++ 3 10 s 4.40 0.264
TeFlow 5 8 s 3.57 0.205

† Optimization-based state of the art.

On nuScenes validation (10 Hz LiDAR):

Method # Frames Runtime 3-way EPE (cm) ↓ Dyn. Norm. EPE ↓
SeFlow++ 3 7.5 s 6.13 0.509
TeFlow 5 7 s 4.64 0.395

Key measures:

  • 3-way EPE: mean error across dynamic/foreground, static/foreground, background/static points.
  • Dynamic bucket-normalized EPE: error normalized by object speed for different semantic classes.

Ablation results indicate superior performance when both internal and external candidates are pooled (DynNorm EPE: 0.265 vs. 0.455 for internal-only, 0.321 for external-only). TeFlow achieves best results with five-frame inputs and suffers when either major loss component is omitted.

Qualitatively, TeFlow predicts smooth and temporally coherent scene flow for small, dynamic, and occluded objects, accurately models articulated motion (e.g., trucks), and preserves trajectory consistency through complex motions, outperforming two-frame and geometric-only baselines (Zhang et al., 22 Feb 2026).

6. Implementation, Limitations, and Availability

TeFlow utilizes clusters (dynamic groupings) within LiDAR frames to perform temporally consistent supervision. The approach depends on effective clustering and nearest-neighbor searches, which, although computationally inexpensive relative to network inference, still scale with number of clusters and candidates. The model backbone is compatible with architectures such as ΔFlow.

A plausible implication is that scenarios with extreme sparsity, highly irregular motion, or clustering failure may challenge the consensus-based ensembling strategy's effectiveness.

Open-source code and trained weights are provided at https://github.com/KTH-RPL/OpenSceneFlow, facilitating reproduction and application in real-world settings.

7. Significance and Impact

TeFlow advances self-supervised scene flow estimation by providing reliable multi-frame supervision in a real-time, feed-forward framework. It achieves up to 33% improvement on dynamic, challenging benchmarks (Argoverse 2 and nuScenes), with accuracy comparable to state-of-the-art optimization-based methods but up to 150×150\times faster. This enables practical deployment of accurate scene flow estimation for applications requiring low latency and robustness, including autonomous driving and robotics (Zhang et al., 22 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeFlow Algorithm.