3D Object Flow in Dynamic Scenes

Updated 2 January 2026

3D object flow is a structured representation that formalizes object-level motion by tracking the displacement of points on rigid, articulated, or deformable objects across time.
It leverages deep neural networks, joint optimization, and diffusion-based approaches to estimate coherent flow fields, supporting robust motion prediction and instance clustering.
This concept is pivotal for applications such as segmentation, tracking, and robotic manipulation, bridging perception with actionable insights in complex, dynamic environments.

3D object flow is a structured representation of object-level 3D motion in dynamic scenes, formalizing the displacement of points on rigid, articulated, or deformable objects across time. This concept generalizes classical per-point scene flow by grouping or tracking correspondences at the object or instance level, enabling richer modeling of rigid-body motion, articulation, deformation, and scene interaction. Modern approaches to estimating 3D object flow have established it as a critical intermediate for perception, segmentation, tracking, robotic policy learning, and open-world manipulation.

1. Formal Definition and Mathematical Foundations

3D object flow describes the temporal evolution of a set of points $\{p_j^t\}$ on a given object from time $t$ to $t+1$ , producing a displacement vector for each point: $\Delta p_j^t = p_j^{t+1} - p_j^t \in \mathbb{R}^3.$ The set of particle-wise displacements over $T$ frames yields a time-indexed tensor of trajectories, typically encapsulated as

$F_{3D} = \{P_{1:T}, V\}, \quad P_{1:T} \in \mathbb{R}^{T \times n \times 3}, \quad V \in \{0,1\}^{T \times n},$

where $n$ is the number of tracked object-centric points, and $V$ encodes temporal visibility. In more general occupancy flow settings, a 3D grid-based field $F_t(x)$ or $F(x, t)$ assigns motion vectors or flow to voxels or point samples in $\mathbb{R}^3$ at each timestep (Liu et al., 2024).

For rigid or articulated motion, a per-object flow may be parameterized by a sequence of SE(3) transformations or articulated joint parameters, while deformable objects admit non-rigid flow fields indexed by mesh vertices or point samples. Self-supervised, differentiable, and optimization-based methods may alternate between estimating per-point flows and object instance clustering to resolve object-level decomposition (Vacek et al., 2024, Teed et al., 2020, Shao et al., 2018).

2. Algorithmic Estimation and Learning Approaches

There are several competing strategies for predicting 3D object flow, including:

Direct regression via deep neural networks: Methods directly map frame pairs (e.g., RGB-D or reconstructed point clouds) to per-point or per-voxel flow fields, optionally outputting rigid-motion parameters or soft object assignments as intermediate representations (e.g., RAFT-3D, (Teed et al., 2020); Motion-based Object Segmentation, (Shao et al., 2018)).
Joint optimization and object clustering: Approaches like Let-It-Flow (Vacek et al., 2024) alternate between global optimization of a data fidelity (e.g., bidirectional Chamfer) loss and rigidity constraints over dynamically discovered hard and soft object clusters. Overlapping soft clusters and spectral methods allow for the recovery of closely spaced or interacting objects without explicit supervision.
Diffusion-based flow prediction: Structured diffusion models generate temporally coherent 3D flow as a latent process, supporting both prediction and conditioning for downstream policy learning (e.g., 3D Flow Diffusion Policy (Noh et al., 23 Sep 2025), 3DFlowAction (Zhi et al., 6 Jun 2025)). In these frameworks, generative diffusion is used to produce plausible, interaction-aware future flow fields conditioned on historical point cloud sequences, proprioceptive signals, and/or language instructions.
Multi-modal and domain-bridging approaches: Dream2Flow (Dharmarajan et al., 31 Dec 2025) reconstructs F₃D from video generations by tracking surface point trajectories, estimating monocular depth, and performing pointwise 3D lifting across frames. Subsequent frame alignment and canonicalization ensure temporal and spatial consistency for manipulation.
Self-supervised volumetric prediction: Occupancy flow networks such as Let Occ Flow (Liu et al., 2024) produce joint 3D occupancy and flow fields on spatial grids, employing multi-view feature lifting, temporal backward-forward deformable attention fusion, and differentiable rendering losses to match future scene appearance and geometry without 3D labels.

3. 3D Object Flow in Perception, Segmentation, and Tracking

3D object flow estimation is tightly connected to the segmentation of moving entities and the tracking of their trajectories at the object level.

Motion-based object segmentation: Methods assign per-point or per-pixel predictions across both motion (rigid/non-rigid SE(3) parameters or flow vectors) and object instance labels in an integrated fashion (Shao et al., 2018, Teed et al., 2020). Rigid-motion embeddings and differentiable cluster-smoothing permit soft grouping of pixels or points into object hypotheses without explicit mask supervision.
Instance-level single object tracking: Point-level flow is used as a building block for robustly inferring instance-level translation, rotation, and extent in both dense (e.g., point clouds, voxels) and projected (BEV) domains (as in FlowTrack, (Li et al., 2024)). Temporal fusion modules integrate local flow cues over multiple frames, enabling more resilient aggregation in sparse or occluded settings.
Dynamic decomposition in multi-object scenes: Self-supervised clustering, such as that in Let-It-Flow (Vacek et al., 2024), leverages rigidity priors and hierarchical cluster merging to resolve dynamic scenes with multiple interacting or closely adjacent moving objects. Spectral analysis of local affinity graphs mitigates both over- and under-segmentation typical of static clustering such as DBSCAN.

4. 3D Object Flow in Robotic Manipulation and Policy Learning

3D object flow has emerged as a scalable, embodiment-agnostic interface between high-level intent (what should happen to objects) and low-level robot actuation (how to realize it).

Visuomotor policy learning: 3D Flow Diffusion Policy (3D FDP (Noh et al., 23 Sep 2025)) predicts temporally extended 3D flow fields for a set of query points and conditions action generation via a second diffusion stage. This two-stage architecture enables the policy to leverage local contact-aware flow information, outperforming baselines relying on global features or direct observation-to-action mapping across hard robotic manipulation benchmarks.
Video-to-action bridging: Frameworks such as Dream2Flow (Dharmarajan et al., 31 Dec 2025) and 3DFlowAction (Zhi et al., 6 Jun 2025) use 3D object flow to translate visual predictions—e.g., video-generated object motions or human demonstrations—into trajectory targets for robotic control. By optimizing robot actions to track the flow-induced state targets (using trajectory optimization or RL), these approaches decouple state-change specification from embodiment constraints, allowing zero-shot guidance, cross-robot transfer, and generalization to unseen objects.
Closed-loop planning and evaluation: 3DFlowAction (Zhi et al., 6 Jun 2025) includes a validation loop in which predicted 3D flow outcomes are rendered and compared against the task instruction using a LLM (GPT-4o), with re-sampling if misaligned. The predicted flows are then posed as hard constraints for an optimization-based action policy.

5. Architectures, Loss Functions, and Training Strategies

A range of architectural and algorithmic motifs are prevalent in the field:

Feature encoders and cross-modal fusion: Siamese and TPV-based encoders map multi-view, multi-frame, and multi-modal inputs into unified feature spaces suitable for volumetric or pointwise flow prediction (Shao et al., 2018, Liu et al., 2024).
Flow and occupancy decoders: Volumetric decoders output both SDF (for occupancy) and vector-valued flow at 3D grid locations; pointwise decoders output per-particle or per-vertex flows.
Supervision: Strong (with per-point ground-truth), weak (self-supervised by photometric consistency), or implicit (via object trajectory or segmentation) supervision is employed depending on the application and available data.
Losses: Chamfer-style bidirectional distance, rigidity or smoothness regularization (e.g., pairwise distances, spectral clustering eigenvalues), per-pixel/voxel regression, photometric reprojection, and differentiable rendering objectives are all employed (Vacek et al., 2024, Liu et al., 2024, Shao et al., 2018). For diffusion models, $\ell_1$ or $\ell_2$ noise-matching losses at each diffusion step are standard (Noh et al., 23 Sep 2025, Zhi et al., 6 Jun 2025).
Dynamic disentanglement and temporal fusion: Temporal modules exploit multi-frame context for improved motion consistency (e.g., backward–forward attention (Liu et al., 2024), historical information fusion (Li et al., 2024)), while dynamic region detection (with segmentation and optical flow) upweights supervision in motion-dominant regions (Liu et al., 2024).

6. Empirical Performance and Evaluation Protocols

Contemporary 3D object flow methods have demonstrated strong performance across a variety of established and novel benchmarks:

Dataset/Task	Method / Reference	Metric(s)	Result(s)
MetaWorld (50 tasks)	3D FDP (Noh et al., 23 Sep 2025)	Success rate (hard tasks)	60.8% (200 query points)
Real robot (8 tasks)	3D FDP (Noh et al., 23 Sep 2025)	Mean success over DP3 baseline	56.9% vs 27.5%
Push-T (sim, 100 trials)	Dream2Flow (Dharmarajan et al., 31 Dec 2025)	Particle dynamics	52/100
Door opening RL (real)	Dream2Flow (Dharmarajan et al., 31 Dec 2025)	Task success (Franka)	100/100 (flow reward)
Manipulation generalization	3DFlowAction (Zhi et al., 6 Jun 2025)	Cross-domain success	70% vs baselines 20–50%
SemanticKITTI (3D occ+flow)	Let Occ Flow (Liu et al., 2024)	IoU@4m (no 3D labels)	66.95%
KITTI Scene Flow	RAFT-3D (Teed et al., 2020)	EPE_3D (cm)	5.77

Significant advances include robust generalization to unseen objects and robots, reliable open-world and cross-domain performance, and strong accuracy on public automotive and manipulation benchmarks, often surpassing direct-pose or 2D-flow-based representations.

7. Open Challenges and Future Directions

Despite substantial progress, several challenges remain:

Deformable and non-rigid motion: Most current pipelines assume rigidity or articulation, limiting performance on highly deformable objects such as cloth and cables (Zhi et al., 6 Jun 2025).
Depth, occlusion, and visibility prediction: Depth estimation errors and occlusions propagate to flow estimation and object interaction prediction, sometimes requiring specialized alignment or learning strategies (Dharmarajan et al., 31 Dec 2025).
End-to-end differentiability: Many video-to-flow-to-action frameworks rely on piecewise or modular pipelines; integrating flow prediction and action optimization into a fully differentiable loop is an active direction (Zhi et al., 6 Jun 2025).
Real-time performance and scalability: Optimization-based and volumetric approaches often incur significant computational overhead, especially for large scenes or real-time robotic control (Vacek et al., 2024).
Multi-object and contact-rich interaction: Fine-grained, simultaneous flow estimation for multiple objects in contact or with mutual constraints (e.g., assembly, tool use) requires advances in clustering, dynamics modeling, and reasoning over physical constraints.

3D object flow is now established as a powerful structural prior and intermediate for robotic policy learning, dynamic scene understanding, and bridging domain gaps between video generation, human demonstration, and robot execution. Its ongoing development continues to advance the state of the art in machine perception, autonomous manipulation, and multi-agent dynamic scene understanding (Noh et al., 23 Sep 2025, Dharmarajan et al., 31 Dec 2025, Vacek et al., 2024, Li et al., 2024, Shao et al., 2018, Teed et al., 2020, Zhi et al., 6 Jun 2025, Liu et al., 2024).