IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression via Implicit Dense Alignment v1

Published 20 Mar 2026 in cs.CV | (2603.19625v1)

Abstract: Relative pose estimation is fundamental for SLAM, visual localization, and 3D reconstruction. Existing Relative Pose Regression (RPR) methods face a key trade-off: feature-matching pipelines achieve high accuracy but block gradient flow via non-differentiable RANSAC, while ViT-based regressors are end-to-end trainable but prohibitively expensive for real-time deployment. We identify the core bottlenecks as the coupling between rotation and translation estimation and insufficient cross-view feature alignment. We propose IUP-Pose, a geometry-driven decoupled iterative framework with implicit dense alignment. A lightweight Multi-Head Bi-Cross Attention (MHBC) module aligns cross-view features without explicit matching supervision. The aligned features are processed by a decoupled rotation-translation pipeline: two shared-parameter rotation stages iteratively refine rotation with uncertainty, and feature maps are realigned via rotational homography H_inf before translation prediction. IUP-Pose achieves 73.3% AUC@20deg on MegaDepth1500 with full end-to-end differentiability, 70 FPS throughput, and only 37M parameters, demonstrating a favorable accuracy-efficiency trade-off for real-time edge deployment.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents a novel decoupled iterative refinement method that independently estimates rotation and translation using uncertainty propagation.
It combines transformer-based implicit dense alignment with homography warping and multi-scale feature extraction to ensure robust cross-view geometric consistency.
Empirical evaluations on MegaDepth1500 demonstrate 70 FPS throughput and 73.3% AUC@20° accuracy, highlighting an optimal speed-accuracy trade-off.

IUP-Pose: Decoupled Iterative Uncertainty Propagation for Real-time Relative Pose Regression

Motivation and Background

Relative pose estimation between RGB images is foundational for downstream 3D perception tasks, including SLAM, visual localization, and large-scale 3D reconstruction. Contemporary pipelines either employ correspondence-based geometric optimization, which is not amenable to end-to-end training due to non-differentiable modules (e.g., RANSAC), or rely on direct, end-to-end regression architectures (predominantly ViT-based), which are resource-intensive and struggle to operate in real-time. The coupling between rotation and translation, as well as suboptimal feature alignment across disparate views, present key bottlenecks in current Relative Pose Regression (RPR) methodologies.

Methodology

IUP-Pose addresses these deficiencies by introducing a geometry-driven, decoupled, iterative refinement scheme with implicit dense alignment to perform real-time relative pose regression.

Decoupled Iterative Refinement. Leveraging the geometric orthogonality between rotation and translation, the architecture decomposes pose regression into sequenced, independent tasks—two rotation estimation stages followed by translation. Rotational predictions are regularized and aligned via iterative refinement, each stage informed by uncertainty propagation.

Implicit Dense Alignment (IDA). The pipeline interleaves feature extraction with a transformer-based cross-view alignment mechanism. RGB images are first augmented with normalized camera plane coordinates and passed through a ResNet encoder to yield multi-scale representations.

Figure 1: IUP-Pose’s overall architecture, highlighting the decoupled iterative design and dense alignment pathway.

The IDA module combines spatial pyramid pooling (SPPF) for multi-scale context aggregation and a multi-head bi-cross attention (MHBC) layer for feature correspondence, ensuring global geometric consistency without per-pixel matching supervision.

Rotation and Translation Decoders. Both decoders share a light-weight architecture and are conditioned on encoded features, input camera intrinsics, and previous predictions plus uncertainties. Homography warping, parameterized by predicted rotations and intrinsics, aligns features across views at each iterative step to correct for rotational disparities, feeding improved features to subsequent refinement and translation prediction.

Figure 2: Shared architecture of the rotation and translation decoders utilizing view fusion and uncertainty gating.

Uncertainty-aware objective functions (aleatoric Laplace-NLL for rotation, directional error for translation) provide robust supervision and enable error-aware iterative refinement.

Empirical Evaluation

Benchmark and Protocol. Experiments are conducted on MegaDepth1500, a benchmark characterized by significant viewpoint, intrinsic, and appearance variations.

Figure 3: MegaDepth image pairs illustrating extreme geometric and photometric disparities.

Speed-Accuracy Trade-offs. IUP-Pose establishes a distinct Pareto front for RPR frameworks, achieving a throughput of 70 FPS (14.3ms latency per pair) and AUC@20° accuracy of 73.3%, outperforming both ViT-based direct regression and correspondence-based pipelines in efficiency while maintaining comparable accuracy.

Figure 4: Speed-accuracy trade-off: IUP-Pose attains a unique position of high accuracy and unmatched inference speed among peer methods.

Ablation Studies substantiate that the largest gains arise from (i) explicit rotation-translation decoupling, (ii) iterative coarse-to-fine residual updates, and (iii) SPPF+MHBC-based dense alignment, with uncertainty propagation and homography warping further contributing to robustness and precision. Transfer learning (ScanNet pretraining) improves performance, especially under relaxed thresholds.

Cross-View Robustness. Unlike keypoint-based pipelines, IUP-Pose maintains strong accuracy across a spectrum of image overlap conditions, including low-overlap regimes notoriously challenging for traditional approaches.

Qualitative Visualizations. Attention heatmaps identify semantically- and geometrically-consistent regions across divergent views. Qualitative warping analyses demonstrate effective rotational disparity correction prior to translation regression.

Figure 5: (a) MHBC attention maps highlight consistent cross-view correspondences; (b) Rotational homography alignment eliminates rotational misalignment, facilitating translation estimation.

Theoretical and Practical Implications

The explicit rotation-translation decoupling, undergirded by geometric priors and homography-based alignment, unifies classic epipolar concepts with differentiable deep learning. This paradigm enables fast, robust, and fully end-to-end trainable pose estimation suited for integration into broader differentiable 3D vision systems. Lightweight architectures (37M parameters) and competitive accuracy on challenging outdoor data suggest direct applicability to real-time edge deployments, including AR navigation and robotics.

The limitations observed on indoor datasets with large rotations and translations (e.g., ScanNet) suggest that pure homography-based warping is suboptimal under limited fields of view—indicating a future research vector toward adaptive multi-scale or out-of-bounds robust alignment mechanisms.

Conclusion

IUP-Pose introduces a geometry-driven, decoupled, and iterative architecture for real-time relative pose regression that achieves a compelling balance of accuracy, speed, and parameter efficiency. Dense, implicit alignment via transformer cross-attention, hybridized with geometric warping and uncertainty-guided iterative refinement, situates IUP-Pose as a competitive solution for real-time camera pose estimation in photogrammetric and robotics pipelines. Future work should explore adaptations to extend robustness under severe viewpoint or spatial constraints, further broadening practical applicability.

Markdown Report Issue