LA-Pose: Latent Action Pretraining Meets Pose Estimation

Published 30 Apr 2026 in cs.CV | (2604.27448v1)

Abstract: This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a self-supervised latent action pretraining strategy that reduces reliance on expensive 3D annotations for camera pose estimation.
It presents a two-stage architecture combining inverse-forward dynamics with a transformer-based pose head to deliver robust, motion-centric performance.
Results demonstrate over 10% higher accuracy on Waymo and strong zero-shot generalization on PandaSet, affirming scalability under diverse conditions.

LA-Pose: Latent Action Pretraining for Feed-Forward Camera Pose Estimation

Motivation and Context

The challenge of scalable and accurate camera pose estimation in autonomous driving and embodied AI is compounded by the limited availability of high-quality annotated 3D datasets. Feed-forward 3D reconstruction approaches (e.g., DUSt3R, VGGT, Rig3R) have advanced the state of the art, but at the cost of dependency on expensive sensor-derived labels, restricting their generalization and scalability. Self-supervised learning paradigms, notably latent action pretraining, have yielded rich representations in robotics and video modeling but have not been systematically leveraged for geometric perception tasks. LA-Pose (2604.27448) directly addresses this gap by proposing an end-to-end self-supervised pretraining pipeline for camera pose estimation using latent actions.

Methodology

LA-Pose introduces a two-stage training architecture:

Latent Action Pretraining: Building on Genie-style architectures, an inverse-forward dynamics model operates over sequences of driving video frames to learn latent actions representing compact, motion-centric inter-frame transformations. The model employs a Vision Transformer image tokenizer, causal temporal masking, and a compressed latent action bottleneck designed to enhance abstraction and minimize information leakage that is detrimental to downstream ego-motion estimation.
Camera Pose Post-Training: Using a small subset of high-quality 3D annotated scenes (Waymo, nuScenes, Argoverse), a lightweight transformer-based pose estimation head is attached to the pretrained backbone. This head predicts relative camera translation, rotation (quaternion), field-of-view, and metric scale directly from latent actions, with the backbone either frozen or fine-tuned during this stage.
Figure 1: Two-stage LA-Pose framework—self-supervised latent action pretraining followed by supervised camera pose prediction leveraging motion-centric representations.

Experimental Evaluation

Quantitative Results

LA-Pose was evaluated on the Waymo Open Dataset (in-distribution) and PandaSet (zero-shot), reporting AUC@5, scale-invariant average trajectory error (ATE-S), and metric ATE (ATE-M). LA-Pose achieves over 10% higher pose accuracy compared to state-of-the-art feed-forward baselines, with substantially less labeled supervision:

Waymo: LA-Pose yields 91.4% AUC@5 and 1.20×10⁻² ATE-S, outperforming Rig3R (77.9%/3.17×10⁻²) and VGGT (74.8%/1.43×10⁻²).
PandaSet (zero-shot): LA-Pose achieves 86.3% AUC@5 and 1.13×10⁻² ATE-S, again surpassing VGGT and MapAnything.
Figure 2: Distribution of pose estimation AUC@5 scores for LA-Pose versus VGGT on the Waymo Open Dataset, highlighting improved average accuracy and lower variance.

Robustness studies indicate that LA-Pose maintains superior accuracy across varying frame sampling rates, demonstrating high temporal resilience in pose estimation.

Qualitative Analysis

Qualitative comparisons show that LA-Pose produces geometrically coherent and stable trajectories on challenging conditions (night, rain, fog, sharp turns) not seen in post-training, underscoring the benefit of large-scale self-supervised motion pretraining.

Figure 3: Representative camera pose estimation results—LA-Pose (green) sustains stable trajectories compared to Rig3R and VGGT under adverse conditions.

Additional qualitative studies on sparse temporal sampling (1 fps) and diverse, uncalibrated OpenDV–YouTube in-the-wild videos further demonstrate LA-Pose's generalization capacity.

Figure 4: Low frame rate (1 fps) results—LA-Pose sustains stable, temporally consistent trajectories where VGGT exhibits drift.

Figure 5: Application to OpenDV–YouTube videos—LA-Pose generalizes to varying urban and non-urban conditions, producing stable camera pose predictions.

Ablations and Failure Modes

Ablations reveal that latent action compression (e.g., 50-D vs. 1536-D) enforces motion-centric abstraction and metric-scale consistency, improving downstream pose metrics at the expense of pretraining reconstruction loss. Freezing the pretrained backbone during pose post-training maximizes generalization, as fine-tuning introduces domain-specific overfitting.

Figure 6: Impact of freezing vs. fine-tuning the inverse-dynamics backbone, showing superior zero-shot generalization to PandaSet with a frozen backbone.

Failure analyses identify edge cases (e.g., reverse motion) as limitations, associated with rarity in post-training datasets. LA-Pose retains partial trajectory consistency due to its exposure to diverse motion patterns in large-scale unlabeled pretraining.

Figure 7: Failure case—degraded accuracy under reverse motion due to distribution gap in labeled post-training data.

Implications and Future Directions

LA-Pose makes a substantive contribution by demonstrating that self-supervised latent action pretraining can be repurposed for efficient, scalable, and highly generalizable feed-forward camera pose estimation. The results empirically validate that large-scale video pretraining can substitute for expensive 3D supervision, lowering the data annotation barrier for internet-scale geometric perception. Practically, such frameworks could be deployed to domains with limited labeled data, including robotics, AR/VR, and embodied video analysis.

Theoretically, the latent action paradigm introduces a motion-centric bottleneck that supports abstraction and transfer, opening avenues for further research in unsupervised geometric learning, out-of-domain pose estimation, and robust 4D representation learning. Scaling up the diversity and scope of pretraining datasets will likely address edge-case limitations and foster broader applicability across varied camera configurations and environments.

Conclusion

LA-Pose establishes a unified framework for camera pose estimation based on self-supervised latent action learning and lightweight supervised fine-tuning. Extensive empirical studies affirm that latent action pretraining accomplishes state-of-the-art pose accuracy with orders of magnitude less labeled data compared to prior methods, exhibiting superior generalization and robustness to challenging temporal and spatial conditions. The approach offers a compelling blueprint for scalable geometric perception in autonomous systems and paves the way for continued exploration of self-supervised, motion-centric representation learning.

Markdown Report Issue