3D-LFM: Lifting Foundation Model (2312.11894v2)
Abstract: The lifting of 3D structure and camera from 2D landmarks is at the cornerstone of the entire discipline of computer vision. Traditional methods have been confined to specific rigid objects, such as those in Perspective-n-Point (PnP) problems, but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO and PAUL) with resilience to noise, occlusions, and perspective distortions. All these techniques, however, have been limited by the fundamental need to establish correspondences across the 3D training data -- significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying number of points per 3D data instance, withstands occlusions, and generalizes to unseen categories. We demonstrate state of the art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind.
- Openmonkeystudio: Automated markerless pose estimation in freely moving macaques. BioRxiv, pages 2020–01, 2020.
- Recovering non-rigid 3d shape from image streams. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), pages 690–696. IEEE, 2000.
- Joint-wise 2d to 3d lifting for hand pose estimation from a single rgb image. Applied Intelligence, 53(6):6421–6431, 2023.
- High fidelity 3d reconstructions with limited physical views. In 2021 International Conference on 3D Vision (3DV), pages 1301–1311. IEEE, 2021.
- Mbw: Multi-view bootstrapping in the wild. Advances in Neural Information Processing Systems, 35:3039–3051, 2022.
- 3d hand shape and pose estimation from a single rgb image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10833–10842, 2019.
- Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
- Unsupervised 3d pose estimation with non-rigid structure-from-motion modeling. arXiv preprint arXiv:2308.10705, 2023.
- Panoptic studio: A massively multiview system for social motion capture. In Proceedings of the IEEE International Conference on Computer Vision, pages 3334–3342, 2015.
- Acinoset: a 3d pose estimation dataset and baseline models for cheetahs in the wild. In 2021 IEEE international conference on robotics and automation (ICRA), pages 13901–13908. IEEE, 2021.
- Deep non-rigid structure from motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1558–1567, 2019.
- Ep n p: An accurate o (n) solution to the p n p problem. International journal of computer vision, 81:155–166, 2009.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Jointformer: Single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. In 2022 26th International Conference on Pattern Recognition (ICPR), pages 1156–1163. IEEE, 2022.
- Amass: Archive of motion capture as surface shapes. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019.
- A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2640–2649, 2017.
- Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 548–564. Springer, 2020.
- C3dpo: Canonical 3d pose networks for non-rigid structure from motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7688–7697, 2019.
- Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, pages 358–374. Springer, 2022a.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Graph attention networks. In International Conference on Learning Representations, 2018.
- Canonpose: Self-supervised monocular 3d human pose estimation in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13294–13304, 2021.
- Paul: Procrustean autoencoder for unsupervised lifting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 434–443, 2021.
- Deep nrsfm++: Towards unsupervised 2d-3d lifting in the wild. In 2020 International Conference on 3D Vision (3DV), pages 12–22. IEEE, 2020.
- Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, pages 75–82. IEEE, 2014.
- Animal3d: A comprehensive dataset of 3d animal pose and shape. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9099–9109, 2023.
- Mhr-net: Multiple-hypothesis reconstruction of non-rigid shapes from 2d views. In European Conference on Computer Vision, pages 1–17. Springer, 2022.
- Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014.
- Robust point cloud processing through positional embedding. arXiv preprint arXiv:2309.00339, 2023.
- Motionbert: Unified pretraining for human motion analysis. arXiv preprint arXiv:2210.06551, 2022.
- H3wb: Human3. 6m 3d wholebody dataset and benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20166–20177, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.