MoDAR: Motion-Depth Aligned Refinement

Updated 4 July 2026

MoDAR is a methodological pattern that couples motion and depth by using iterative feedback to mutually refine geometric estimates.
It encompasses both strong cyclic refinements (e.g., DualRefine) and weaker association-level methodologies (e.g., DepthMOT) for different vision tasks.
Empirical results show that the bidirectional coupling in MoDAR improves depth accuracy and pose estimation, proving its practical benefits.

Searching arXiv for the cited papers to ground the article in current metadata. arXiv Search: (Bangunharcana et al., 2023) Motion-Depth Aligned Refinement (MoDAR) denotes a family of geometry-aware inference and learning strategies in which motion and depth are not treated as independent outputs, but as mutually conditioning variables whose estimates are aligned and refined through feedback. In the strongest formulations, current motion determines the search geometry used for depth refinement, while current depth in turn conditions motion updates; in weaker formulations, motion-derived geometry supervises, regularizes, or disambiguates depth and association decisions without a fully symmetric inference loop. The term is also used explicitly as a module name in monocular video human mesh recovery, where motion tokens are aligned with depth-enhanced fused representations before residual refinement of pose and shape (Cen et al., 4 Feb 2026), while closely related coupled depth-pose systems such as DualRefine instantiate the broader pattern in self-supervised depth and odometry estimation (Bangunharcana et al., 2023).

1. Definition, scope, and nomenclature

MoDAR is best understood as a methodological pattern rather than a single architecture. Its central requirement is bidirectional dependence between motion and depth: motion should affect how depth-relevant correspondences or geometric constraints are formed, and depth should affect how motion is updated, weighted, or interpreted. In this sense, methods that merely predict depth and pose side by side are outside the strongest reading of MoDAR, whereas methods that refine one modality using a frozen estimate of the other occupy a weaker position within the same design space. This suggests a spectrum ranging from tightly coupled recurrent or equilibrium formulations to looser association-time or training-time couplings (Bangunharcana et al., 2023).

The literature also uses the acronym “MoDAR” in a distinct sense. In “MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences,” the acronym explicitly stands for “Motion forecasting based Detection And Ranging,” not Motion-Depth Aligned Refinement, and the method concerns motion-forecasting-based virtual point augmentation for LiDAR detection rather than a depth-motion co-refinement module (Li et al., 2023). By contrast, the phrase “Motion-Depth Aligned Refinement” is explicitly used as a named module in “Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery,” where it denotes a temporal refinement stage driven by cross-modal attention between motion dynamics and depth-enhanced fused features (Cen et al., 4 Feb 2026).

A further distinction is between strong and weak MoDAR instantiations. DualRefine is presented as a particularly clear instance because the current pose determines the epipolar geometry used for depth matching, while the current depth and hidden correspondence state determine the next pose update (Bangunharcana et al., 2023). By comparison, depth-aware multiple-object tracking methods such as DepthMOT and DepTR-MOT refine association decisions by adding depth cues to motion- or IoU-based costs; these are still MoDAR-like, but chiefly at the association level rather than as full joint latent-state optimizers (Khanchi et al., 1 Jun 2025, Deng et al., 22 Sep 2025).

2. Historical antecedents and precursor formulations

Several earlier systems established the conceptual ingredients later synthesized under MoDAR. DeMoN formulated two-frame structure from motion as a learned, iterative process in which optical flow is first estimated, then used to estimate depth and camera motion, and then recycled back into subsequent updates. Its architecture alternates optical flow estimation with depth-and-motion estimation, and the iterative net feeds previous depth and motion into the next flow stage while also converting previous flow and motion into a depth proposal for the next depth-and-motion stage (Ummenhofer et al., 2016). This already embodies a coupled motion-depth refinement logic, even though it is supervised and two-frame.

A second precursor is DEAR, the RGB–ToF system for “Deep End-to-End Alignment and Refinement for Time-of-Flight RGB-D Module.” There, cross-modal optical flow aligns ToF observations to RGB, and the aligned depth is then refined by a kernel prediction network with kernel normalization and a bias prior. The alignment stage itself is depth-assisted through a weak-calibration depth-to-flow conversion, and the full objective is explicitly joint: $L_{\rm total}=L_{\rm rough}+L_{\rm refn}+L_{\rm depth}$ (Qiu et al., 2019). This makes alignment and depth correction mutually dependent, albeit in a cross-modal RGB-D rather than monocular setting.

DFineNet moved the coupling into RGB-guided depth completion with sparse, noisy depth. It predicts dense refined depth from RGB plus sparse depth, predicts relative pose from consecutive RGB frames, and ties the two through differentiable view synthesis. The paper’s central claim is that depth completion should not be treated as a purely per-frame fusion problem, and pose estimation should not be solved afterward as a separate module; instead, they are trained together through a shared geometric reprojection objective (Zhang et al., 2019). Although the coupling is strongest during training rather than through explicit iterative test-time refinement, the method shows that temporal motion cues can regularize dense depth especially where sparse depth is absent.

These precursors imply that MoDAR did not emerge from a single paper, but from the progressive convergence of three ideas: learned correspondence estimation, geometric alignment, and depth-motion co-supervision. Later work makes that convergence more explicit by embedding both motion and depth inside the same recurrent state or by introducing explicit aligned refinement modules.

3. Geometric and optimization principles

The geometric substrate of MoDAR is the differentiable projection relation that couples a pixel, its depth, camera intrinsics, and relative pose. In DualRefine, given target-image pixel $u=(x,y)$ , target depth $D[u]$ , camera intrinsics $K$ , and relative pose $T_{t\rightarrow s}$ , the source projection is

$z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$

followed by source-to-target warping

$I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$

In DualRefine, the current pose $T_k$ determines the epipolar geometry used to sample candidate correspondences for depth refinement, and the resulting matching costs are recomputed at each iteration rather than built once from a frozen upstream pose (Bangunharcana et al., 2023).

This coupling is made explicit in the model’s equilibrium formulation:

$(h^*, D^*, T^*) = z^* = \mathrm{U}(z^*, x),$

where the equilibrium variable includes hidden state $h$ , depth $u=(x,y)$ 0, and pose $u=(x,y)$ 1. The method therefore does not operate as a depth-only or pose-only recurrent optimizer; it is an equilibrium over a coupled motion-depth state (Bangunharcana et al., 2023). At the update level, depth refinement uses local epipolar matching costs

$u=(x,y)$ 2

while pose refinement solves

$u=(x,y)$ 3

with Lie-group update

$u=(x,y)$ 4

The crucial MoDAR property is that depth and motion enter one another’s update operators directly.

DiMoDE shows a different but complementary route to alignment: explicit decomposition of ego-motion into rotation, tangential translation, and radial translation. The paper derives that tangential motion induces

$u=(x,y)$ 5

while radial motion induces

$u=(x,y)$ 6

It then aligns correspondences into plane-aligned and axis-aligned forms so that transformed flows exhibit the regularities of pure tangential or pure radial motion. These aligned flows are constrained by

$u=(x,y)$ 7

and

$u=(x,y)$ 8

After alignment, closed-form depth-translation ratios such as

$u=(x,y)$ 9

couple depth and individual translation components in both directions (Zhang et al., 3 Nov 2025).

A general implication is that MoDAR methods can rely on at least three distinct forms of alignment: epipolar resampling under current pose, motion-component normalization into simplified coordinate systems, and sparse geometric alignment of dense predictions to multi-view reconstructions. All three appear in the literature, and all aim to reduce the mismatch between how motion is estimated and how depth is supervised or refined.

4. Canonical forms of motion-depth coupling

The literature supports a taxonomy of several recurring MoDAR forms. The first is iterative coupled inference, represented most clearly by DualRefine. There, a single-frame teacher initializes $D[u]$ 0 and $D[u]$ 1, after which a multi-frame module repeatedly updates depth, hidden state, and pose using local epipolar matching, Conv-GRU state propagation, and direct feature-metric pose alignment. The paper explicitly notes that its model that does not perform pose updates has the worst depth accuracy, making the case that motion refinement materially improves depth refinement (Bangunharcana et al., 2023).

The second is joint training with geometric reprojection, represented by DFineNet. Its depth branch remains per-frame at inference, but temporal coupling enters through a pose branch and a photometric loss applied in regions where sparse input depth is absent. In that regime, motion cues densify supervision while sparse depth provides metric anchors (Zhang et al., 2019). A related variant is test-time sparse-geometry refinement, where SfM-TTR runs COLMAP on a test sequence, obtains sparse 3D points and camera poses, projects those points into the image, robustly aligns the scale between sparse SfM depth and dense network depth, and updates only the encoder with a confidence-weighted sparse depth loss (Izquierdo et al., 2022). This is motion-depth alignment mediated by SfM rather than by a learned pose head.

A third form is association-level motion-depth refinement in tracking. DepTR-MOT augments detector outputs with an instance-level depth scalar $D[u]$ 2, then refines second-stage matching costs through

$D[u]$ 3

so that ambiguous motion- or IoU-based associations are penalized when depth is inconsistent (Deng et al., 22 Sep 2025). DepthMOT similarly introduces depth as an independent decision matrix and refines IoU with a Hierarchical Alignment Score

$D[u]$ 4

before summing it with motion, depth, and appearance scores in the final assignment matrix (Khanchi et al., 1 Jun 2025). These are weaker MoDAR forms because depth does not update the tracker’s state dynamics directly, but instead refines association.

The following representative systems illustrate this spectrum.

Representative method	Setting	MoDAR-relevant coupling
“DeMoN” (Ummenhofer et al., 2016)	Two-frame monocular SfM	Alternates flow and depth-motion estimation with iterative feedback
“DFineNet” (Zhang et al., 2019)	RGB + sparse noisy depth	Joint depth completion and ego-motion via reprojection loss
“DualRefine” (Bangunharcana et al., 2023)	Self-supervised monocular depth and pose	Recurrent equilibrium over hidden state, depth, and pose
“SfM-TTR” (Izquierdo et al., 2022)	Test-time monocular depth refinement	Aligns dense depth to sparse SfM geometry and poses
“DepthMOT” (Khanchi et al., 1 Jun 2025)	Online MOT	Adds depth and hierarchical alignment to motion-based association
“DepTR-MOT” (Deng et al., 22 Sep 2025)	Tracking-by-detection MOT	Uses detector-predicted instance depth to refine assignment costs

A plausible implication is that MoDAR is less defined by a specific backbone than by where the bidirectional dependency is inserted: inside recurrent inference, inside the loss, at test-time adaptation, or inside downstream association.

5. The explicit MoDAR module in monocular video human mesh recovery

The most literal use of the name appears in “Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery,” where MoDAR is a downstream refinement stage applied after depth-guided fusion and D-MAPS initialization (Cen et al., 4 Feb 2026). Its role is to address temporal inconsistency, occlusion artifacts, residual jitter, and failures that remain after metric-aware initialization.

MoDAR in this setting takes three inputs that are stated explicitly: fused RGB-depth features

$D[u]$ 5

motion tokens derived from lifted skeletal representations, and initialized SMPL pose and shape. The module uses motion tokens as queries and fused tokens $D[u]$ 6 as keys and values in two stacked cross-attention blocks with bidirectional information flow, layer normalization, and a compact feed-forward network to produce a context feature $D[u]$ 7 (Cen et al., 4 Feb 2026).

Refinement then proceeds through a gated residual update with a causal temporal filter:

$D[u]$ 8

This combines refinement around the D-MAPS initialization, parameter-wise gating through $D[u]$ 9, and online temporal smoothing through $K$ 0. The paper emphasizes that depth information acts solely as a feature cue without requiring depth-specific loss functions; MoDAR is therefore geometry-informed through representation alignment rather than through an explicit depth loss (Cen et al., 4 Feb 2026).

The ablation evidence supports the role of MoDAR as a genuine refinement stage. On the 3DPW test set, the baseline is reported as MPJPE $K$ 1, PA-MPJPE $K$ 2, MPVPE $K$ 3, and Accel $K$ 4, while “+ MoDAR (w/o D-MAPS)” yields MPJPE $K$ 5, PA-MPJPE $K$ 6, MPVPE $K$ 7, and Accel $K$ 8, and the complete system reaches MPJPE $K$ 9, PA-MPJPE $T_{t\rightarrow s}$ 0, MPVPE $T_{t\rightarrow s}$ 1, and Accel $T_{t\rightarrow s}$ 2 (Cen et al., 4 Feb 2026). This indicates that MoDAR contributes materially, but performs best when paired with the metric-aware initialization it is designed to refine.

6. Empirical evidence, misconceptions, and limitations

The strongest quantitative support for MoDAR-style coupling comes from ablations that remove one side of the motion-depth loop. In DualRefine, the pose-update ablation shows that the “no update” model yields $T_{t\rightarrow s}$ 3, whereas enabling pose updates reaches $T_{t\rightarrow s}$ 4; the paper explicitly states that the model that does not perform pose updates has the worst accuracy (Bangunharcana et al., 2023). The same loop also improves pose: on KITTI odometry Seq. 09, translation error drops from $T_{t\rightarrow s}$ 5 to $T_{t\rightarrow s}$ 6, rotation error from $T_{t\rightarrow s}$ 7 to $T_{t\rightarrow s}$ 8 deg/100m, and ATE from $T_{t\rightarrow s}$ 9 m to $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 0 m after refinement (Bangunharcana et al., 2023).

In the sparse-depth regime, DFineNet shows that robustness is not automatic but learned. On TUM with noisy sparse input and noisy train/test, the method reports RMSE $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 1 versus $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 2 for Ma et al.; when trained clean and tested noisy, its RMSE collapses to $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 3 (Zhang et al., 2019). The result is directly relevant to MoDAR because it shows that motion-based regularization can improve refinement under noise, but only when the corruption regime is represented during training. SfM-TTR adds a different caution: when SfM fails because of low texture, low parallax, or dynamic content, no reliable sparse geometry is available for refinement, and the original network prediction is retained (Izquierdo et al., 2022).

Association-level MoDAR variants also have clear but narrower effects. In DepthMOT’s DanceTrack ablation, adding depth to appearance plus bbox IoU increases HOTA from $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 4 to $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 5, AssA from $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 6 to $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 7, and IDF1 from $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 8 to $z'u' = z'\begin{bmatrix}x'& y'& 1\end{bmatrix} = KT_{t\rightarrow s}\Biggl(D[u]K^{-1}\begin{bmatrix}x& y& 1\end{bmatrix}\Biggr),$ 9; replacing bbox IoU with HAS gives comparable gains, and combining HAS with depth reaches HOTA $I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$ 0, AssA $I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$ 1, and IDF1 $I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$ 2 (Khanchi et al., 1 Jun 2025). DepTR-MOT reports especially large association improvements on QuadTrack, where ByteTrack with depth rises from HOTA $I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$ 3 to $I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$ 4 and AssA from $I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$ 5 to $I_{s\rightarrow t}[u] = I_s\langle u' \rangle .$ 6 (Deng et al., 22 Sep 2025). These results support the claim that depth can refine motion-based ambiguity, but they do not imply full 3D state estimation.

Several misconceptions recur. MoDAR does not necessarily mean classical full joint optimization over all scene variables, and some of the strongest examples are explicitly not bundle adjustment. DualRefine is “better characterized as coupled alternating refinement within a recurrent/equilibrium architecture than as one monolithic joint solver” (Bangunharcana et al., 2023). DFineNet couples depth and pose mainly during training rather than through iterative online refinement (Zhang et al., 2019). DepthMOT and DepTR-MOT use depth chiefly to refine association costs, not to update a 3D dynamical state (Khanchi et al., 1 Jun 2025, Deng et al., 22 Sep 2025). Conversely, the presence of a depth channel alone is insufficient; the defining property is aligned dependency between depth and motion.

Limitations are correspondingly method-specific but structurally similar. DualRefine still struggles with dynamic objects and can amplify outliers in difficult regions (Bangunharcana et al., 2023). The human-mesh MoDAR module depends on the quality of lifted joints and depth features, and it uses depth as a feature cue rather than an explicit geometric constraint (Cen et al., 4 Feb 2026). DiMoDE requires reliable rigid correspondences and uses progressive training because misaligned early predictions can make the component-wise transformations unreliable (Zhang et al., 3 Nov 2025). A plausible synthesis is that MoDAR is most effective when the scene is sufficiently rigid, correspondences are reasonably reliable, and the refinement pathway has access to an initialization that is already within the basin of a geometrically consistent solution.