TeFlow: Real-Time Multi-Frame Scene Flow Estimation

Updated 27 February 2026

TeFlow is a self-supervised algorithm that generates robust, temporally stable supervisory signals for real-time scene flow estimation from LiDAR data.
It employs a novel temporal ensembling strategy that aggregates motion cues from multi-frame LiDAR sequences to overcome occlusions and sensor noise.
Quantitative evaluations indicate that TeFlow achieves comparable accuracy to optimization-based methods while offering up to 150× speedup for practical autonomous driving applications.

TeFlow is a self-supervised algorithm for feed-forward scene flow estimation that enables multi-frame supervision by mining temporally consistent supervisory signals from sequences of LiDAR point clouds. Unlike traditional two-frame self-supervised approaches, which often fail under occlusions and sensor noise due to unreliable correspondences, TeFlow introduces a temporal ensembling strategy to form robust, temporally stable guidance signals for training feed-forward neural networks. This advancement permits real-time inference while achieving accuracy previously associated only with slower, optimization-based test-time approaches (Zhang et al., 22 Feb 2026).

1. Motivation and Context

Standard self-supervised scene flow estimation from LiDAR data typically relies on two-frame point correspondences between $P_t$ and $P_{t+1}$ . This approach is fragile in the presence of occlusions, sensor noise, and sparsely observed objects (e.g., dynamic or articulated entities), leading to instability and degraded flow prediction. In contrast, multi-frame supervision utilizes motion cues from multiple temporal contexts, enhancing the stability and consistency of supervision, especially under challenging real-world conditions.

Feed-forward approaches train a unified neural network on large unlabeled datasets, producing scene flow predictions in a single forward pass and supporting real-time deployment. However, they have historically underperformed in dynamic or ambiguous scenarios due to reliance on only two-frame objectives. Optimization-based methods (such as NSFP, Floxels, EulerFlow) excel in accuracy by enforcing multi-frame constraints through expensive test-time iterative fitting, but at high computational cost, requiring minutes to hours per scene. TeFlow bridges this gap by providing multi-frame supervision in a feed-forward pipeline, achieving both high accuracy and low latency (Zhang et al., 22 Feb 2026).

2. Temporal Ensembling for Multi-frame Supervision

TeFlow's core innovation is its temporal ensembling strategy, which aggregates the most temporally consistent motion cues from a candidate pool spanning multiple frames. The algorithm maintains an explicit candidate pool of clusterwise motion hypotheses drawn from both the network's internal predictions and external geometric cues derived using nearest-neighbor searches across dynamic clusters over a temporal window.

Let $P_t \in \mathbb{R}^{N_t \times 3}$ denote the LiDAR point cloud at time $t$ , and $\mathcal{C}_j \subset P_{t,d}$ the $j$ th dynamic cluster at $t$ . For each cluster:

Internal candidate: $\hat f_{\mathcal C_j} = \frac{1}{|\mathcal C_j|}\sum_{p_i\in\mathcal C_j} \hat F_{t→t+1}(p_i)$ .
External candidates: For each frame $t' \in \{t-h,\ldots,t-1,t+1\}$ and selected $p_k \in \mathcal{C}_j$ with large displacement, set $f^{t'}_{\mathcal{C}_j,k} = \frac{\mathrm{NN}(p_k,\,P_{t',d}) - p_k}{t' - t}$ , using Top- $K$ largest displacements.
Candidate pool: Construct $\mathcal F_{\mathcal C_j} = \{\hat f_{\mathcal C_j}\} \cup \{f^{t'}_{\mathcal C_j,k}\}$ .
Directional consistency: Build matrix $\mathbf M$ with entries $1$ if cosine similarity between candidates exceeds a threshold $\tau_{\text{cos}}$ .
Reliability weighting: $w_i = \gamma^{m_i}(1 + \|\mathbf f_i\|^2)$ for candidate $i$ at temporal distance $m_i$ .
Consensus: Compute $\mathbf S = \mathbf M\,\mathbf w$ , select consensus index $a^\dagger=\arg\max_i \mathbf S_i$ , and aggregate $\bar f_{\mathcal{C}_j}$ via reliability-weighted average over agreeing candidates.

This strategy robustly filters noisy or erroneous point correspondences by prioritizing temporally consistent cues, reliably supervising network training in the presence of occlusions and dynamic scenes.

3. Network Training and Loss Functions

TeFlow employs cluster-based and pointwise supervisory losses. The primary terms are:

Dynamic cluster loss ( $\mathcal{L}_{\text{dcls}}$ ):

$\mathcal{L}_{\text{dcls}} = \frac{1}{|\mathcal{P}_C|} \sum_j \sum_{p_i\in \mathcal{C}_j} \|\hat F_{t→t+1}(p_i) - \bar f_{\mathcal{C}_j}\|_2^2 + \frac{1}{N_c} \sum_j \left( \frac{1}{|\mathcal{C}_j|} \sum_{p_i\in \mathcal{C}_j} \|\hat F_{t→t+1}(p_i) - \bar f_{\mathcal{C}_j}\|_2^2 \right)$

Static loss ( $\mathcal{L}_{\text{static}}$ ):

$\mathcal{L}_{\text{static}} = \frac{1}{|P_{t,s}|} \sum_{p_i\in P_{t,s}} \|\hat F_{t→t+1}(p_i)\|_1$

Geometric consistency loss ( $\mathcal{L}_{\text{geom}}$ ): Multi-frame Chamfer distances enforce that the warped $P_t$ aligns with temporally neighboring frames.

The total training objective is:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{dcls}} + \lambda_s\,\mathcal{L}_{\text{static}} + \lambda_g\,\mathcal{L}_{\text{geom}}$

Empirical ablation analysis shows that dropping either $\mathcal{L}_{\text{dcls}}$ or $\mathcal{L}_{\text{static}}$ leads to substantial degradation in dynamic object accuracy; full loss composition is necessary for optimal performance (Zhang et al., 22 Feb 2026).

4. Computational Complexity and Inference Performance

TeFlow's inference consists of a single forward pass through a 4D-CNN or voxel-based backbone, such as ΔFlow, for each sequence. Temporal ensembling incurs a complexity of $O(N_c K h \log N)$ per cluster (for nearest-neighbor search and matrix operations), which is negligible compared with the backbone's computation. Overall, the method achieves real-time inference.

By contrast, optimization-based approaches like Floxels and EulerFlow rely on per-scene iterative optimization, with runtime costs scaling as $O(N_{\text{iter}} \times \text{model\_eval})$ , where $N_{\text{iter}}$ can be $10^4$ – $10^5$ . Reported runtimes for optimization-based methods range from 12 minutes (FastNSF) to 24 minutes (Floxels) to 24 hours (EulerFlow) per scene.

TeFlow achieves substantial speed improvement: inference on Argoverse 2 sequences (five-frame input) requires 8 s per sequence ( $\sim$ 157 frames) on ten NVIDIA RTX 3080 GPUs; training takes 15–20 h. On nuScenes, inference time is 7 s per $\sim$ 200-frame sequence, representing a $150\times$ speedup over optimization-based state of the art (Zhang et al., 22 Feb 2026).

5. Quantitative and Qualitative Evaluation

Extensive evaluation on the Argoverse 2 and nuScenes datasets demonstrates competitive accuracy and efficiency.

Method	# Frames	Runtime	3-way EPE (cm) ↓	Dyn. Norm. EPE ↓
Floxels†	13	24 m	3.57	0.154
SeFlow++	3	10 s	4.40	0.264
TeFlow	5	8 s	3.57	0.205

† Optimization-based state of the art.

On nuScenes validation (10 Hz LiDAR):

Method	# Frames	Runtime	3-way EPE (cm) ↓	Dyn. Norm. EPE ↓
SeFlow++	3	7.5 s	6.13	0.509
TeFlow	5	7 s	4.64	0.395

Key measures:

3-way EPE: mean error across dynamic/foreground, static/foreground, background/static points.
Dynamic bucket-normalized EPE: error normalized by object speed for different semantic classes.

Ablation results indicate superior performance when both internal and external candidates are pooled (DynNorm EPE: 0.265 vs. 0.455 for internal-only, 0.321 for external-only). TeFlow achieves best results with five-frame inputs and suffers when either major loss component is omitted.

Qualitatively, TeFlow predicts smooth and temporally coherent scene flow for small, dynamic, and occluded objects, accurately models articulated motion (e.g., trucks), and preserves trajectory consistency through complex motions, outperforming two-frame and geometric-only baselines (Zhang et al., 22 Feb 2026).

6. Implementation, Limitations, and Availability

TeFlow utilizes clusters (dynamic groupings) within LiDAR frames to perform temporally consistent supervision. The approach depends on effective clustering and nearest-neighbor searches, which, although computationally inexpensive relative to network inference, still scale with number of clusters and candidates. The model backbone is compatible with architectures such as ΔFlow.

A plausible implication is that scenarios with extreme sparsity, highly irregular motion, or clustering failure may challenge the consensus-based ensembling strategy's effectiveness.

Open-source code and trained weights are provided at https://github.com/KTH-RPL/OpenSceneFlow, facilitating reproduction and application in real-world settings.

7. Significance and Impact

TeFlow advances self-supervised scene flow estimation by providing reliable multi-frame supervision in a real-time, feed-forward framework. It achieves up to 33% improvement on dynamic, challenging benchmarks (Argoverse 2 and nuScenes), with accuracy comparable to state-of-the-art optimization-based methods but up to $150\times$ faster. This enables practical deployment of accurate scene flow estimation for applications requiring low latency and robustness, including autonomous driving and robotics (Zhang et al., 22 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TeFlow Algorithm.