FlowNet3D: A Deep 3D Scene Flow Network

Updated 7 February 2026

The paper introduces an end-to-end deep learning framework that directly estimates per-point 3D scene flow on unordered point clouds using hierarchical feature aggregation and a novel flow embedding layer.
It employs specialized modules such as set convolutions for multi-scale feature extraction and learned upsampling layers to ensure accurate, permutation-invariant predictions.
The approach achieves significant performance gains on benchmarks like FlyingThings3D and KITTI, enabling robust applications in robotics, autonomous driving, and dynamic 3D reconstruction.

FlowNet3D is a deep neural architecture designed for per-point 3D scene flow estimation directly on unordered point clouds. Scene flow—the estimation of 3D motion vectors for each point in a dynamic scene—is fundamental for applications in robotics, autonomous driving, and dynamic 3D reconstruction. FlowNet3D pioneered an end-to-end learning approach that operates directly on point sets, in contrast to conventional methods relying on image data or voxelized inputs. This framework employs specialized hierarchical feature aggregation, a dedicated flow embedding layer, and learned upsampling for accurate, permutation-invariant prediction of scene flow on raw geometric data (Liu et al., 2018, &&&1&&&).

1. Network Architecture

FlowNet3D processes a pair of point clouds, typically denoted as the source $P = \{x_i\}$ and target $Q = \{y_j\}$ , each consisting of 3D coordinates (and optionally features such as RGB). The network consists of three cascaded modules:

Hierarchical Feature Extraction (SetConv layers): Four levels of set convolution (sampling + grouping + per-point MLP + max-pooling) extract multi-scale, permutation-invariant features for each input. Each layer samples a subset of points using farthest point sampling, aggregates features over their local neighborhoods, and increases the receptive field by progressively increasing the search radius.
Flow Embedding Layer: For each subsampled point on $P$ , this layer finds spatial neighbors from $Q$ within a radius, concatenates their features and relative coordinates, and aggregates the result using an MLP followed by max-pooling. This yields high-dimensional "flow-aware" features encoding both local geometry and likely motion correspondence.
Set UpConv (Feature Propagation): Four upsampling layers propagate features from sparse to dense points by learning interpolation via set convolution logic. Skip connections are used to preserve multi-scale context.
Per-Point Regression: The final upsampled features are fed to a linear MLP, yielding a 3D flow vector for each original source point.

A summary of the principal layers appears below (Table 1):

Layer Type	Neighborhood Radius	Key Operation
Set Conv (×4)	0.5, 1.0, 2.0, 4.0	Local feature pooling
Flow Embedding	5.0	Feature mixing with neighbor search
Set UpConv (×4)	4.0, 2.0, 1.0, 0.5	Learned upsampling

2. Flow Embedding and Feature Propagation

The core of FlowNet3D is the flow embedding layer, which, for each point $x_i$ in $P$ , searches for neighbors in $Q$ within radius $r$ and processes the triple $[f_i, g_j, y_j - x_i]$ for each neighbor using an MLP $h(\cdot)$ :

$e_i = \max_{j: \|y_j - x_i\| \leq r} h([f_i, g_j, y_j - x_i])$

This mechanism blends learned feature similarity and spatial offset, allowing the network to implicitly estimate point correspondences and local motion patterns. The Set UpConv layers upsample these sparse features to all input points using analogous neighborhood feature aggregation, but with fixed "center" positions.

These operations are permutation-invariant and end-to-end learnable, enabling direct application to raw point clouds without requiring rasterization or intermediate representations (Liu et al., 2018).

3. Loss Functions and Training Strategy

The original FlowNet3D is trained using a regression loss combining smooth- $l_1$ endpoint error and a cycle-consistency regularizer:

$\mathcal{L}(\Theta) = \frac{1}{n_1} \sum_{i=1}^{n_1} \left[ \|d_i - d^*_i\|_1 + \lambda \|d'_i + d_i\|_2 \right]$

where $d_i$ is the predicted flow, $d^*_i$ is ground truth, and $d'_i$ is the reverse flow estimate for cycle consistency. The main training data is FlyingThings3D (synthetic), with aggressive augmentation strategies. The method generalizes to real LiDAR and RGB-D data from datasets such as KITTI without additional fine-tuning, evidencing strong cross-domain resilience (Liu et al., 2018, Wang et al., 2019).

FlowNet3D++ (Wang et al., 2019) further augments the loss with two geometric regularizers:

Point-to-Plane Distance Loss: Encourages the predicted flow to align the source point to the target surface normal, echoing classical ICP objectives:

$\mathcal{L}_{pp} = \frac{1}{N} \sum_{x_s \in \mathcal{X}_s} \left[ n(x_t)^\top (x_s + v(x_s) - x_t) \right]^2$

Angular Alignment Loss: Penalizes deviation in direction between predicted and ground-truth motion:

$\mathcal{L}_{cos} = \frac{1}{N} \sum_{i=1}^N \left[ 1 - \frac{v_i \cdot v_i^{gt}}{\|v_i\|\|v_i^{gt}\|} \right]$

The combined loss is:

$\mathcal{L} = \mathcal{L}_2 + \lambda_p \mathcal{L}_{pp} + \lambda_{cos} \mathcal{L}_{cos}$

with recommended weights $\lambda_p \approx 1.3$ , $\lambda_{cos} \approx 0.9$ .

4. Evaluation and Quantitative Results

FlowNet3D achieves competitive accuracy on the FlyingThings3D and KITTI datasets.

FlyingThings3D (Test Split):

Method	[email protected]	[email protected]	EPE (m)	ADE (deg)
FlowNet3D	25.37%	57.85%	0.1694	22.6
FlowNet3D++	30.3%	63.4%	0.1369	21.1

ACC: fraction of points with endpoint error below threshold; EPE: mean endpoint error; ADE: angular deviation error.

On KITTI, FlowNet3D (without fine-tuning) obtains 0.122 m EPE and 5.61% outlier rate; with refinement or fine-tuning, error decreases further (Liu et al., 2018, Wang et al., 2019).

FlowNet3D++ brings:

Up to 20% improvement in [email protected] and 19.2% reduction in EPE (geometry+RGB inputs).
On KITTI: 36% reduction in outlier rate and 22% reduction in EPE compared to baseline.

Dynamic 3D Reconstruction:

Using a global TSDF (truncated signed distance function) integration pipeline, FlowNet3D++ achieves:

Up to 35.2% lower mesh-to-mesh error than KillingFusion.
On the "Snoopy" sequence: KillingFusion 3.543 mm, +FlowNet3D++ 2.297 mm (Wang et al., 2019).

5. Practical Applications and Integration

FlowNet3D outputs have been integrated into several downstream tasks:

Dynamic Scene Reconstruction: By warping points using predicted scene flow and integrating via standard TSDF algorithms (KinectFusion update), non-rigid, temporally consistent surface reconstructions can be built. The pipeline uses iterative rigid registration, scene flow prediction and warping, synthetic depth generation, volumetric fusion, and variational refinement of the non-rigid deformation field (Wang et al., 2019).
Partial-Scan Registration: Scene flow enables robust alignment between partial point clouds, especially in settings where rigid ICP fails due to missing correspondences. A direct warp using predicted flow, followed by rigid alignment (SVD), achieves significantly lower registration error.
Motion Segmentation: Appending the flow vectors as extended features ([x, y, z, α dx, α dy, α dz]) enables clustering-based segmentation of different moving objects directly on LiDAR data.

These demonstrate that direct point-cloud-based scene flow estimation can enable robust registration, segmentation, and dynamic fusion in real-world, sparsely sampled environments.

6. Architectural and Methodological Considerations

All layers are designed to be permutation-invariant and exploit local geometric structure.
The architecture is lightweight (≈15 MB), with runtime per frame between 18 and 100 ms.
Hyperparameters such as radius settings, MLP width, and inference-time re-sampling are essential for adapting the network to point cloud density.
In practice, per-point normals required for the geometric loss are computed offline via PCA on local neighborhoods for training on FlyingThings3D.
No increase in model parameters is introduced by the geometric augmentation in FlowNet3D++ compared to FlowNet3D.

Key limitations include the reliance on large-scale synthetic data for effective training and the need to tune neighborhood radii for different LiDAR types. Inference time can be affected by re-sampling for robustness.

7. Impact and Research Significance

FlowNet3D established the first end-to-end framework to estimate dense 3D scene flow on unordered point sets, introducing learning components—Flow Embedding Layer and Set UpConv Layer—that are now pillars of modern point-based geometric deep learning. The geometric extensions in FlowNet3D++ demonstrated that classical registration losses, when integrated into deep networks, yield measurable gains in accuracy and stability without increasing model size or architectural complexity (Wang et al., 2019, Liu et al., 2018).

These advances have underpinned subsequent development in 3D perception for robotics, autonomous driving, and dynamic environment modeling, with FlowNet3D and its variants serving as foundation architectures for research in 3D flow, registration, and segmentation on raw sensor data.

Markdown Upgrade to Chat

References (2)

FlowNet3D: Learning Scene Flow in 3D Point Clouds (2018)

FlowNet3D++: Geometric Losses For Deep Scene Flow Estimation (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlowNet3D.