- The paper introduces a self-supervised latent action pretraining strategy that reduces reliance on expensive 3D annotations for camera pose estimation.
- It presents a two-stage architecture combining inverse-forward dynamics with a transformer-based pose head to deliver robust, motion-centric performance.
- Results demonstrate over 10% higher accuracy on Waymo and strong zero-shot generalization on PandaSet, affirming scalability under diverse conditions.
LA-Pose: Latent Action Pretraining for Feed-Forward Camera Pose Estimation
Motivation and Context
The challenge of scalable and accurate camera pose estimation in autonomous driving and embodied AI is compounded by the limited availability of high-quality annotated 3D datasets. Feed-forward 3D reconstruction approaches (e.g., DUSt3R, VGGT, Rig3R) have advanced the state of the art, but at the cost of dependency on expensive sensor-derived labels, restricting their generalization and scalability. Self-supervised learning paradigms, notably latent action pretraining, have yielded rich representations in robotics and video modeling but have not been systematically leveraged for geometric perception tasks. LA-Pose (2604.27448) directly addresses this gap by proposing an end-to-end self-supervised pretraining pipeline for camera pose estimation using latent actions.
Methodology
LA-Pose introduces a two-stage training architecture:
- Latent Action Pretraining: Building on Genie-style architectures, an inverse-forward dynamics model operates over sequences of driving video frames to learn latent actions representing compact, motion-centric inter-frame transformations. The model employs a Vision Transformer image tokenizer, causal temporal masking, and a compressed latent action bottleneck designed to enhance abstraction and minimize information leakage that is detrimental to downstream ego-motion estimation.
- Camera Pose Post-Training: Using a small subset of high-quality 3D annotated scenes (Waymo, nuScenes, Argoverse), a lightweight transformer-based pose estimation head is attached to the pretrained backbone. This head predicts relative camera translation, rotation (quaternion), field-of-view, and metric scale directly from latent actions, with the backbone either frozen or fine-tuned during this stage.
Figure 1: Two-stage LA-Pose framework—self-supervised latent action pretraining followed by supervised camera pose prediction leveraging motion-centric representations.
Experimental Evaluation
Quantitative Results
LA-Pose was evaluated on the Waymo Open Dataset (in-distribution) and PandaSet (zero-shot), reporting AUC@5, scale-invariant average trajectory error (ATE-S), and metric ATE (ATE-M). LA-Pose achieves over 10% higher pose accuracy compared to state-of-the-art feed-forward baselines, with substantially less labeled supervision:
Robustness studies indicate that LA-Pose maintains superior accuracy across varying frame sampling rates, demonstrating high temporal resilience in pose estimation.
Qualitative Analysis
Qualitative comparisons show that LA-Pose produces geometrically coherent and stable trajectories on challenging conditions (night, rain, fog, sharp turns) not seen in post-training, underscoring the benefit of large-scale self-supervised motion pretraining.
Figure 3: Representative camera pose estimation results—LA-Pose (green) sustains stable trajectories compared to Rig3R and VGGT under adverse conditions.
Additional qualitative studies on sparse temporal sampling (1 fps) and diverse, uncalibrated OpenDV–YouTube in-the-wild videos further demonstrate LA-Pose's generalization capacity.
Figure 4: Low frame rate (1 fps) results—LA-Pose sustains stable, temporally consistent trajectories where VGGT exhibits drift.
Figure 5: Application to OpenDV–YouTube videos—LA-Pose generalizes to varying urban and non-urban conditions, producing stable camera pose predictions.
Ablations and Failure Modes
Ablations reveal that latent action compression (e.g., 50-D vs. 1536-D) enforces motion-centric abstraction and metric-scale consistency, improving downstream pose metrics at the expense of pretraining reconstruction loss. Freezing the pretrained backbone during pose post-training maximizes generalization, as fine-tuning introduces domain-specific overfitting.
Figure 6: Impact of freezing vs. fine-tuning the inverse-dynamics backbone, showing superior zero-shot generalization to PandaSet with a frozen backbone.
Failure analyses identify edge cases (e.g., reverse motion) as limitations, associated with rarity in post-training datasets. LA-Pose retains partial trajectory consistency due to its exposure to diverse motion patterns in large-scale unlabeled pretraining.
Figure 7: Failure case—degraded accuracy under reverse motion due to distribution gap in labeled post-training data.
Implications and Future Directions
LA-Pose makes a substantive contribution by demonstrating that self-supervised latent action pretraining can be repurposed for efficient, scalable, and highly generalizable feed-forward camera pose estimation. The results empirically validate that large-scale video pretraining can substitute for expensive 3D supervision, lowering the data annotation barrier for internet-scale geometric perception. Practically, such frameworks could be deployed to domains with limited labeled data, including robotics, AR/VR, and embodied video analysis.
Theoretically, the latent action paradigm introduces a motion-centric bottleneck that supports abstraction and transfer, opening avenues for further research in unsupervised geometric learning, out-of-domain pose estimation, and robust 4D representation learning. Scaling up the diversity and scope of pretraining datasets will likely address edge-case limitations and foster broader applicability across varied camera configurations and environments.
Conclusion
LA-Pose establishes a unified framework for camera pose estimation based on self-supervised latent action learning and lightweight supervised fine-tuning. Extensive empirical studies affirm that latent action pretraining accomplishes state-of-the-art pose accuracy with orders of magnitude less labeled data compared to prior methods, exhibiting superior generalization and robustness to challenging temporal and spatial conditions. The approach offers a compelling blueprint for scalable geometric perception in autonomous systems and paves the way for continued exploration of self-supervised, motion-centric representation learning.