DoGFlow: Radar-Guided LiDAR Scene Flow

Updated 4 July 2026

The paper introduces DoGFlow, a self-supervised framework that transfers motion cues from radar Doppler to LiDAR via a cross-modal pseudo-labeling process.
It employs a novel two-stage architecture with radar-based dynamic clustering and range-adaptive LiDAR association to resolve motion ambiguity.
Extensive experiments on MAN TruckScenes demonstrate robust, long-range scene flow estimation with over 90% recovery of fully supervised performance using only 10% labels.

Searching arXiv for the named paper and closely related scene-flow work to ground the article. arxiv_search(query="DoGFlow Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance", max_results=5) DoGFlow is a self-supervised LiDAR scene flow framework that transfers motion information from 4D mmWave radar Doppler to LiDAR in order to generate dense 3D scene flow pseudo-labels without manual ground-truth annotation. It is formulated as a cross-modal, training-free pseudo-labeler coupled to a training pipeline for LiDAR backbones, with radar used only during pseudo-label generation and not at inference. In the reported MAN TruckScenes experiments, DoGFlow substantially outperforms prior self-supervised baselines, remains competitive at long range, and enables a LiDAR backbone to recover over $90\%$ of fully supervised performance with only $10\%$ of the ground-truth labels (Khoche et al., 25 Aug 2025).

1. Problem setting and motivation

LiDAR scene flow estimates dense 3D motion between successive point clouds. In autonomous driving, the task is directly relevant to detection, tracking, segmentation, and prediction, and the paper emphasizes that long-range motion in the $50\text{–}200\ \mathrm{m}$ regime and robustness to rain, snow, and fog are particularly important for safe planning (Khoche et al., 25 Aug 2025).

DoGFlow is motivated by limitations in both dominant supervision regimes. Fully supervised approaches can achieve strong performance, but their dependence on expensive human labeling creates a scaling bottleneck, and long-range boundaries and adverse-weather scenarios remain underrepresented in labeled datasets. LiDAR-only self-supervised approaches instead depend on geometric correspondences such as Chamfer or cycle consistency, which deteriorate when LiDAR geometry becomes sparse or noisy, often producing over-smoothing or unstable optimization. DoGFlow addresses this gap by exploiting 4D radar Doppler, which is robust to adverse weather and directly measures radial velocity, while explicitly handling the fact that Doppler is radial-only and susceptible to multipath and ghost returns (Khoche et al., 25 Aug 2025).

A central design choice is therefore cross-modal label transfer rather than direct radar inference at deployment time. The reported system uses radar to recover object-level motion cues where LiDAR-only geometric matching is weakest, then transfers those cues into the LiDAR domain to supervise a conventional feedforward LiDAR model. This suggests a separation between an offline label-generation stage and a radar-free online inference stage.

2. Two-stage architecture

DoGFlow is organized as a two-stage pipeline. The first stage estimates cluster-level velocities from radar; the second propagates those velocities to LiDAR points through dynamic-aware association and ambiguity-resolved label propagation (Khoche et al., 25 Aug 2025).

In the radar stage, Doppler measurements are first ego-motion compensated and used to detect dynamic radar points. Dynamic points are then grouped by graph-based clustering, specifically Connected Components Labeling, in a joint space of position and compensated velocity. Each connected component is treated as a radar cluster corresponding to a moving object, and a full 3D translational velocity is estimated for that cluster by least squares under physically plausible bounds.

In the LiDAR stage, points are denoised and ground is removed before cross-modal association. LiDAR points are associated to dynamic radar points with a range-adaptive nearest-neighbor rule. High-intensity associated LiDAR points are clustered with HDBSCAN, and low-intensity points are reintegrated into the nearest cluster within a fixed neighbor radius. A LiDAR cluster is labeled dynamic if a majority of its associated radar points are dynamic. The resulting radar-derived velocity or velocities are then transferred to the LiDAR cluster. When multiple radar clusters are associated to the same LiDAR cluster, DoGFlow resolves the ambiguity by forward-projecting the LiDAR cluster under each candidate velocity and selecting the one that best matches the next LiDAR scan under Chamfer distance (Khoche et al., 25 Aug 2025).

This architecture is notable for avoiding global bipartite assignment. The paper states that no global Hungarian matching is required because the association is local and many-to-one at the cluster level.

3. Radar kinematics, ego compensation, and ambiguity resolution

DoGFlow operates in calibrated sensor and ego frames. Let $E$ denote the ego or vehicle frame and $S_i$ a radar frame. Using the extrinsic calibration $T_{E\leftarrow S}$ , points are transformed as

$x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$

TruckScenes provides synchronized, time-stamped multi-LiDAR and multi-4D-radar at $10\ \mathrm{Hz}$ , and the framework fuses all points in the ego frame at each timestamp before association (Khoche et al., 25 Aug 2025).

The radar signal model begins with the Doppler relation

$f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$

where $v_r$ is radial velocity. For a radar return $10\%$ 0 with position $10\%$ 1, measured Doppler radial velocity $10\%$ 2, and unit line of sight $10\%$ 3, DoGFlow ego-compensates Doppler as

$10\%$ 4

and marks a radar point dynamic when $10\%$ 5. The implementation sets $10\%$ 6 (Khoche et al., 25 Aug 2025).

Dynamic radar clustering enforces both spatial and Doppler coherence. Two radar points $10\%$ 7 are connected when

$10\%$ 8

with $10\%$ 9 and $50\text{–}200\ \mathrm{m}$ 0. For each cluster $50\text{–}200\ \mathrm{m}$ 1, DoGFlow estimates a constant 3D translational velocity $50\text{–}200\ \mathrm{m}$ 2 by solving $50\text{–}200\ \mathrm{m}$ 3 with bound-constrained least squares. The paper also presents the more general rigid-body model

$50\text{–}200\ \mathrm{m}$ 4

and the corresponding Doppler constraint

$50\text{–}200\ \mathrm{m}$ 5

but states that DoGFlow instantiates a translation-only model for robustness under sparse and noisy radar and short baselines (Khoche et al., 25 Aug 2025).

Cross-modal propagation uses a range-adaptive gate $50\text{–}200\ \mathrm{m}$ 6 that increases linearly with range from $50\text{–}200\ \mathrm{m}$ 7 to $50\text{–}200\ \mathrm{m}$ 8. High-intensity LiDAR points are selected using $50\text{–}200\ \mathrm{m}$ 9, and low-intensity neighbors are reassigned within $E$ 0. Dynamic labeling is then determined by majority voting: if more than $E$ 1 of the associated radar points for a LiDAR cluster are dynamic, the cluster is marked dynamic. When several radar clusters map to the same LiDAR cluster, the ambiguity is resolved by minimizing symmetric Chamfer distance after forward projection:

$E$ 2

with $E$ 3 and $E$ 4. The chosen velocity is then propagated as $E$ 5 (Khoche et al., 25 Aug 2025).

4. Pseudo-label generation and supervision of LiDAR backbones

For a LiDAR point $E$ 6, scene flow is defined as

$E$ 7

With known ego-motion, the model predicts residual non-ego flow through

$E$ 8

DoGFlow supplies dense pseudo-labels $E$ 9 for LiDAR points in dynamic clusters, and optionally zeros for static points. These labels supervise a standard LiDAR backbone; the paper uses SSF as the main feedforward model and also compares to DeFlow (Khoche et al., 25 Aug 2025).

The primary training objective is direct regression to pseudo-labels:

$S_i$ 0

In the reported experiments, the weights are set to $S_i$ 1 for points with labels, and unlabeled points are ignored. The paper notes that forward–backward consistency, smoothness regularization, and occlusion or warping losses are compatible standard extensions, but they are not required by DoGFlow’s core pipeline (Khoche et al., 25 Aug 2025).

A practical consequence of this design is that radar is absent at deployment. The training-free DoGFlow pseudo-labeler runs offline, after which a feedforward LiDAR model can be trained and deployed in real time. On an RTX 3090, the paper reports $S_i$ 2 per frame and peak memory below $S_i$ 3 for the labeler, versus $S_i$ 4 per frame for the feedforward SSF trained on DoGFlow labels (Khoche et al., 25 Aug 2025).

5. Dataset, metrics, and empirical results

The experiments are conducted on MAN TruckScenes, which contains $S_i$ 5 scenes of approximately $S_i$ 6 each at $S_i$ 7, with clear, overcast, rain, snow, and fog conditions, multiple long-range LiDARs, and modern 4D mmWave radars with $S_i$ 8 coverage. The training split has $S_i$ 9 frames, and 3D boxes are available every fifth frame. The paper explicitly notes that existing scene-flow benchmarks lack 4D radar, making TruckScenes well suited to cross-modal Doppler-guided supervision (Khoche et al., 25 Aug 2025).

Evaluation uses EPE3D, range-wise dynamic EPE, dynamic IoU, and three-way EPE. EPE3D is the per-point $T_{E\leftarrow S}$ 0 error $T_{E\leftarrow S}$ 1. Dynamic IoU is computed on a dynamic mask defined by $T_{E\leftarrow S}$ 2 per frame. Three-way EPE averages over Foreground Dynamic, Foreground Static, and Background Static (Khoche et al., 25 Aug 2025).

On the TruckScenes validation set, DoGFlow reports a three-way EPE of $T_{E\leftarrow S}$ 3, identified as best among the self-supervised methods in the table and $T_{E\leftarrow S}$ 4 better than FastNSF’s $T_{E\leftarrow S}$ 5. Its range-wise dynamic EPE is $T_{E\leftarrow S}$ 6 for $T_{E\leftarrow S}$ 7 and $T_{E\leftarrow S}$ 8 for $T_{E\leftarrow S}$ 9, while dynamic IoU is $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 0 and $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 1 in the same bins, respectively. The paper emphasizes that DoGFlow degrades slowly with range and remains competitive at $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 2, where Chamfer-based methods fail because of sparsity and occlusion (Khoche et al., 25 Aug 2025).

When pseudo-labels are used to train SSF, DoGFlow again exceeds alternative pseudo-label sources. The reported dynamic EPE for SSF trained on DoGFlow pseudo-labels is $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 3 at $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 4 and $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 5 at $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 6, with dynamic IoU of $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 7 and $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 8. For comparison, FastNSF pseudo-labels yield $x_E = R_{E\leftarrow S} x_S + t_{E\leftarrow S}.$ 9 dynamic EPE and $10\ \mathrm{Hz}$ 0 dynamic IoU, while ICP-Flow pseudo-labels yield $10\ \mathrm{Hz}$ 1 and $10\ \mathrm{Hz}$ 2 (Khoche et al., 25 Aug 2025).

The paper’s label-efficiency result is one of its strongest reported findings. SSF pretrained with DoGFlow pseudo-labels and then fine-tuned with only $10\ \mathrm{Hz}$ 3 ground truth reaches mean dynamic EPE $10\ \mathrm{Hz}$ 4, compared with $10\ \mathrm{Hz}$ 5 for fully supervised SSF, which the paper summarizes as over $10\ \mathrm{Hz}$ 6 of fully supervised performance with $10\ \mathrm{Hz}$ 7 labels. Zero-shot performance after DoGFlow pretraining, at $10\ \mathrm{Hz}$ 8, is described as comparable to training from scratch with $10\ \mathrm{Hz}$ 9 ground truth, at $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 0 (Khoche et al., 25 Aug 2025).

In adverse weather, the mean range-wise dynamic EPE and IoU for DoGFlow are reported as follows: clear $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 1, overcast $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 2, rain $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 3, snow $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 4, and fog $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 5. The paper states that DoGFlow strongly outperforms Chamfer-based self-supervised baselines in IoU across all weather conditions and achieves the best or second-best EPE in most weather categories, with particularly strong snow performance attributed to radar robustness (Khoche et al., 25 Aug 2025).

An ablation further isolates the value of Doppler-based dynamic awareness. Replacing DUFOMap with radar-based dynamic classification inside SeFlow improves dynamic EPE by $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 6 at $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 7 and $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 8 at $f_D = \frac{2 f_c}{c} v_r,\qquad v_r = \frac{c f_D}{2 f_c},$ 9, while improving dynamic IoU by $v_r$ 0 and $v_r$ 1, respectively (Khoche et al., 25 Aug 2025).

6. Limitations, deployment considerations, and nomenclature

The paper identifies several failure modes. DoGFlow is sensitive to calibration because it requires accurate, static extrinsics between radar and LiDAR; miscalibration produces systematic velocity bias and association error. Close-range and slow movers remain difficult because 4D radar Doppler resolution and sensor blind spots can miss low-speed lateral motion, such as pedestrians near the vehicle sides. Rotational motion is not modeled explicitly because DoGFlow estimates only cluster-level constant translation, so significant within-object rotation cannot be recovered from radial Doppler alone under the chosen formulation. Severe multipath and aliasing can still corrupt the candidate velocity set despite the ambiguity-resolution mechanism. Finally, the training-free labeler is not real time at roughly $v_r$ 2 per frame, so the intended use is offline pseudo-label generation followed by real-time LiDAR inference at roughly $v_r$ 3 per frame (Khoche et al., 25 Aug 2025).

These limitations directly shape deployment practice. The recommended integration is to use DoGFlow offline with a multi-4D-radar and multi-LiDAR sensor suite, generate large-scale pseudo-labels under synchronized timestamps and precise extrinsic calibration, and then train a sparse-convolution LiDAR scene-flow backbone such as SSF or DeFlow. The paper notes that graph-based radar clustering and Chamfer-based ambiguity resolution dominate runtime and are natural targets for acceleration through spatial indexing and GPU nearest-neighbor search. It also identifies range-adaptive association, intensity-aware LiDAR clustering, and majority voting as crucial to weather and long-range robustness (Khoche et al., 25 Aug 2025).

The term itself requires disambiguation. In (Khoche et al., 25 Aug 2025), DoGFlow is the official name of the LiDAR scene-flow method based on cross-modal Doppler guidance. However, the later time-series paper “DoFlow: Causal Generative Flows for Interventional and Counterfactual Time-Series Prediction” states that “DoGFlow” is sometimes used informally to refer to its own method, whose official name is DoFlow rather than a separate model (Wu et al., 4 Nov 2025). A further neighboring name, DiG-Flow, denotes an unrelated discrepancy-guided regularization framework for Vision-Language-Action policies (Zhang et al., 1 Dec 2025). A plausible implication is that “DoGFlow” should be interpreted by domain context: in autonomous-driving perception it refers to cross-modal radar-guided LiDAR scene flow, whereas nearby flow-model literature contains unrelated naming collisions.