Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

3D-LFM: Lifting Foundation Model (2312.11894v2)

Published 19 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Openmonkeystudio: Automated markerless pose estimation in freely moving macaques. BioRxiv, pages 2020–01, 2020.
  2. Recovering non-rigid 3d shape from image streams. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), pages 690–696. IEEE, 2000.
  3. Joint-wise 2d to 3d lifting for hand pose estimation from a single rgb image. Applied Intelligence, 53(6):6421–6431, 2023.
  4. High fidelity 3d reconstructions with limited physical views. In 2021 International Conference on 3D Vision (3DV), pages 1301–1311. IEEE, 2021.
  5. Mbw: Multi-view bootstrapping in the wild. Advances in Neural Information Processing Systems, 35:3039–3051, 2022.
  6. 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10833–10842, 2019.
  7. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  8. Unsupervised 3d pose estimation with non-rigid structure-from-motion modeling. arXiv preprint arXiv:2308.10705, 2023.
  9. Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pages 3334–3342, 2015.
  10. Acinoset: a 3d pose estimation dataset and baseline models for cheetahs in the wild. In 2021 IEEE international conference on robotics and automation (ICRA), pages 13901–13908. IEEE, 2021.
  11. Deep non-rigid structure from motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1558–1567, 2019.
  12. Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision, 81:155–166, 2009.
  13. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  14. Jointformer: Single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 1156–1163. IEEE, 2022.
  15. Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
  16. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017.
  17. Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 548–564. Springer, 2020.
  18. C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7688–7697, 2019.
  19. Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022a.
  20. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022b.
  21. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  22. Graph attention networks. In International Conference on Learning Representations, 2018.
  23. Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13294–13304, 2021.
  24. Paul: Procrustean autoencoder for unsupervised lifting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 434–443, 2021.
  25. Deep nrsfm++: Towards unsupervised 2d-3d lifting in the wild. In 2020 International Conference on 3D Vision (3DV), pages 12–22. IEEE, 2020.
  26. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, pages 75–82. IEEE, 2014.
  27. Animal3d: A comprehensive dataset of 3d animal pose and shape. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9099–9109, 2023.
  28. Mhr-net: Multiple-hypothesis reconstruction of non-rigid shapes from 2d views. In European Conference on Computer Vision, pages 1–17. Springer, 2022.
  29. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014.
  30. Robust point cloud processing through positional embedding. arXiv preprint arXiv:2309.00339, 2023.
  31. Motionbert: Unified pretraining for human motion analysis. arXiv preprint arXiv:2210.06551, 2022.
  32. H3wb: Human3. 6m 3d wholebody dataset and benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20166–20177, 2023.
Citations (4)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces an object-agnostic transformer that lifts 2D landmarks into 3D structures without requiring object-specific data.
  • It employs Procrustean transformations, tokenized positional encoding, and a hybrid attention mechanism to enhance scalability and reduce complexity.
  • Experimental results demonstrate a significant MPJPE reduction and robust generalization to unseen object categories, outperforming state-of-the-art methods.

Overview of "3D-LFM: Lifting Foundation Model"

The paper presents the 3D Lifting Foundation Model (3D-LFM), a transformative approach in the field of computer vision, specifically in lifting 2D landmarks to 3D structures. The cornerstone of this research is developing a model that transcends the limitations of traditional methods and recent deep learning techniques. This model is capable of reconstructing a wide range of objects, inclusive of human forms, animals, and inanimate objects, without needing explicit object-specific data during training.

Introduction and Problem Statement

The problem of lifting 2D landmarks from single-view RGB images into 3D structures poses significant challenges in computer vision due to its ill-posed nature. Traditional methods such as Perspective-n-Point (PnP) and recent deep learning approaches like C3DPO and PAUL require precise correspondences between 2D and 3D data and often lack scalability and generalizability across diverse object categories. These constraints hinder their application in scenarios with limited or no in-correspondence 3D data.

Contributions of 3D-LFM

The 3D-LFM model addresses key limitations by introducing an object-agnostic approach for 2D-3D lifting. It utilizes permutation equivariance inherent in transformers, enabling the model to autonomously establish correspondences among 2D keypoints. This method supports the reconstruction of over 30 object categories using a single model and demonstrates robust generalization to unseen categories and configurations.

Core Innovations:

  1. Procrustean Transformations: The model integrates Procrustean methods such as Orthographic-N-Point (OnP) to efficiently focus on deformable aspects within a canonical frame, reducing computational complexity.
  2. Tokenized Positional Encoding (TPE): This novel approach replaces fixed or learned positional encodings with Fourier-based TPE, enhancing the model's scalability and its capacity to handle imbalanced datasets.
  3. Hybrid Attention Mechanism: Combining graph-based local attention with global self-attention in transformers allows the model to capture both local and global contextual information, crucial for accurate 2D-3D lifting across varied categories.

Experimental Results

3D-LFM's capabilities are extensively evaluated against state-of-the-art methods in 2D-3D lifting tasks. The model showcases superiority in benchmarks, including human body, face, hand datasets, and beyond.

Key Findings and Performance Metrics:

  • Multi-Object 3D Reconstruction: When benchmarked against C3DPO, 3D-LFM yields lower Mean-per-joint-position-error (MPJPE), especially significant when object-specific information is withheld (MPJPE of 3.27 on combined categories compared to C3DPO's 41.08).
  • Object-Specific Models: On the H3WB benchmark, the model outperforms specialized methods with an overall MPJPE improvement, achieving 33.13 mm with Procrustes Alignment, substantially better than alternatives.
  • OOD Generalization and Rig Transfer: Demonstrates robust generalization to unseen object categories and configurations, maintaining high fidelity in 3D reconstruction. For instance, generalizing from 17-joint to 15-joint human body rigs, showing the model's adaptability.

Implications and Future Directions

The implications of this research are profound for both theoretical advancements and practical applications in AI. By decoupling the 2D-3D lifting task from the need for object-specific data, the model enhances scalability and adaptability, opening new avenues for applications in augmented reality, robotics, and beyond.

Future Developments:

  • Enhanced Depth Perception: Incorporating appearance cues to resolve depth ambiguities in single-frame reconstructions.
  • Broader Dataset Inclusion: Expanding the range of object categories and configurations to further refine the model's generalization capabilities.
  • Integrative Frameworks: Exploring hybrid models that combine 3D-LFM's strengths with other advanced techniques, like DINOv2 features, to enhance overall performance and robustness in diverse environmental conditions.

Conclusion

The 3D-LFM sets a new benchmark in the domain of 2D-3D lifting by providing a unified, scalable solution capable of handling a broad spectrum of object categories with high accuracy and generalizability. Its innovative approach in leveraging permutation equivariance, Procrustean transformations, and hybrid attention mechanisms positions it as a foundational model in computer vision, paving the way for more advanced and adaptable 3D reconstruction technologies.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.