Papers
Topics
Authors
Recent
Search
2000 character limit reached

LA-Pose: Latent Action Pretraining Meets Pose Estimation

Published 30 Apr 2026 in cs.CV | (2604.27448v1)

Abstract: This paper revisits camera pose estimation through the lens of self-supervised pretraining, focusing on inverse-dynamics pretraining as a scalable alternative to the current trend of fully supervised training with 3D annotations. Concretely, we employ inverse- and forward-dynamics models to learn latent action representations, similar to Genie from large-scale driving videos. Our idea is simple yet effective. Existing methods use latent actions in their original capacity, that is, as action conditioning of world-models or as proxies of robot action parameters in policy networks. Our method, dubbed LA-Pose, repurposes the latent action features as inputs to a camera pose estimator, finetuned on a limited set of high-quality 3D annotations. This formulation enables accurate and generalizable pose prediction while maintaining feed-forward efficiency. Extensive experiments on driving benchmarks show that LA-Pose achieves competitive and even superior performance to state-of-the-art methods while using orders of magnitude less labeled data. Concretely, on the Waymo and PandaSet benchmarks, LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods. To our knowledge, this work is the first to demonstrate the power of inverse-dynamics self-supervised learning for pose estimation.

Summary

  • The paper introduces a self-supervised latent action pretraining strategy that reduces reliance on expensive 3D annotations for camera pose estimation.
  • It presents a two-stage architecture combining inverse-forward dynamics with a transformer-based pose head to deliver robust, motion-centric performance.
  • Results demonstrate over 10% higher accuracy on Waymo and strong zero-shot generalization on PandaSet, affirming scalability under diverse conditions.

LA-Pose: Latent Action Pretraining for Feed-Forward Camera Pose Estimation

Motivation and Context

The challenge of scalable and accurate camera pose estimation in autonomous driving and embodied AI is compounded by the limited availability of high-quality annotated 3D datasets. Feed-forward 3D reconstruction approaches (e.g., DUSt3R, VGGT, Rig3R) have advanced the state of the art, but at the cost of dependency on expensive sensor-derived labels, restricting their generalization and scalability. Self-supervised learning paradigms, notably latent action pretraining, have yielded rich representations in robotics and video modeling but have not been systematically leveraged for geometric perception tasks. LA-Pose (2604.27448) directly addresses this gap by proposing an end-to-end self-supervised pretraining pipeline for camera pose estimation using latent actions.

Methodology

LA-Pose introduces a two-stage training architecture:

  1. Latent Action Pretraining: Building on Genie-style architectures, an inverse-forward dynamics model operates over sequences of driving video frames to learn latent actions representing compact, motion-centric inter-frame transformations. The model employs a Vision Transformer image tokenizer, causal temporal masking, and a compressed latent action bottleneck designed to enhance abstraction and minimize information leakage that is detrimental to downstream ego-motion estimation.
  2. Camera Pose Post-Training: Using a small subset of high-quality 3D annotated scenes (Waymo, nuScenes, Argoverse), a lightweight transformer-based pose estimation head is attached to the pretrained backbone. This head predicts relative camera translation, rotation (quaternion), field-of-view, and metric scale directly from latent actions, with the backbone either frozen or fine-tuned during this stage. Figure 1

    Figure 1: Two-stage LA-Pose framework—self-supervised latent action pretraining followed by supervised camera pose prediction leveraging motion-centric representations.

Experimental Evaluation

Quantitative Results

LA-Pose was evaluated on the Waymo Open Dataset (in-distribution) and PandaSet (zero-shot), reporting AUC@5, scale-invariant average trajectory error (ATE-S), and metric ATE (ATE-M). LA-Pose achieves over 10% higher pose accuracy compared to state-of-the-art feed-forward baselines, with substantially less labeled supervision:

  • Waymo: LA-Pose yields 91.4% AUC@5 and 1.20×10⁻² ATE-S, outperforming Rig3R (77.9%/3.17×10⁻²) and VGGT (74.8%/1.43×10⁻²).
  • PandaSet (zero-shot): LA-Pose achieves 86.3% AUC@5 and 1.13×10⁻² ATE-S, again surpassing VGGT and MapAnything. Figure 2

    Figure 2: Distribution of pose estimation AUC@5 scores for LA-Pose versus VGGT on the Waymo Open Dataset, highlighting improved average accuracy and lower variance.

Robustness studies indicate that LA-Pose maintains superior accuracy across varying frame sampling rates, demonstrating high temporal resilience in pose estimation.

Qualitative Analysis

Qualitative comparisons show that LA-Pose produces geometrically coherent and stable trajectories on challenging conditions (night, rain, fog, sharp turns) not seen in post-training, underscoring the benefit of large-scale self-supervised motion pretraining. Figure 3

Figure 3: Representative camera pose estimation results—LA-Pose (green) sustains stable trajectories compared to Rig3R and VGGT under adverse conditions.

Additional qualitative studies on sparse temporal sampling (1 fps) and diverse, uncalibrated OpenDV–YouTube in-the-wild videos further demonstrate LA-Pose's generalization capacity. Figure 4

Figure 4: Low frame rate (1 fps) results—LA-Pose sustains stable, temporally consistent trajectories where VGGT exhibits drift.

Figure 5

Figure 5: Application to OpenDV–YouTube videos—LA-Pose generalizes to varying urban and non-urban conditions, producing stable camera pose predictions.

Ablations and Failure Modes

Ablations reveal that latent action compression (e.g., 50-D vs. 1536-D) enforces motion-centric abstraction and metric-scale consistency, improving downstream pose metrics at the expense of pretraining reconstruction loss. Freezing the pretrained backbone during pose post-training maximizes generalization, as fine-tuning introduces domain-specific overfitting. Figure 6

Figure 6: Impact of freezing vs. fine-tuning the inverse-dynamics backbone, showing superior zero-shot generalization to PandaSet with a frozen backbone.

Failure analyses identify edge cases (e.g., reverse motion) as limitations, associated with rarity in post-training datasets. LA-Pose retains partial trajectory consistency due to its exposure to diverse motion patterns in large-scale unlabeled pretraining. Figure 7

Figure 7: Failure case—degraded accuracy under reverse motion due to distribution gap in labeled post-training data.

Implications and Future Directions

LA-Pose makes a substantive contribution by demonstrating that self-supervised latent action pretraining can be repurposed for efficient, scalable, and highly generalizable feed-forward camera pose estimation. The results empirically validate that large-scale video pretraining can substitute for expensive 3D supervision, lowering the data annotation barrier for internet-scale geometric perception. Practically, such frameworks could be deployed to domains with limited labeled data, including robotics, AR/VR, and embodied video analysis.

Theoretically, the latent action paradigm introduces a motion-centric bottleneck that supports abstraction and transfer, opening avenues for further research in unsupervised geometric learning, out-of-domain pose estimation, and robust 4D representation learning. Scaling up the diversity and scope of pretraining datasets will likely address edge-case limitations and foster broader applicability across varied camera configurations and environments.

Conclusion

LA-Pose establishes a unified framework for camera pose estimation based on self-supervised latent action learning and lightweight supervised fine-tuning. Extensive empirical studies affirm that latent action pretraining accomplishes state-of-the-art pose accuracy with orders of magnitude less labeled data compared to prior methods, exhibiting superior generalization and robustness to challenging temporal and spatial conditions. The approach offers a compelling blueprint for scalable geometric perception in autonomous systems and paves the way for continued exploration of self-supervised, motion-centric representation learning.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 0 likes about this paper.