Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception (2306.06362v2)
Abstract: We introduce the Aria Digital Twin (ADT) - an egocentric dataset captured using Aria glasses with extensive object, environment, and human level ground truth. This ADT release contains 200 sequences of real-world activities conducted by Aria wearers in two real indoor scenes with 398 object instances (324 stationary and 74 dynamic). Each sequence consists of: a) raw data of two monochrome camera streams, one RGB camera stream, two IMU streams; b) complete sensor calibration; c) ground truth data including continuous 6-degree-of-freedom (6DoF) poses of the Aria devices, object 6DoF poses, 3D eye gaze vectors, 3D human poses, 2D image segmentations, image depth maps; and d) photo-realistic synthetic renderings. To the best of our knowledge, there is no existing egocentric dataset with a level of accuracy, photo-realism and comprehensiveness comparable to ADT. By contributing ADT to the research community, our mission is to set a new standard for evaluation in the egocentric machine perception domain, which includes very challenging research problems such as 3D object detection and tracking, scene reconstruction and understanding, sim-to-real learning, human pose prediction - while also inspiring new machine perception tasks for augmented reality (AR) applications. To kick start exploration of the ADT research use cases, we evaluated several existing state-of-the-art methods for object detection, segmentation and image translation tasks that demonstrate the usefulness of ADT as a benchmarking dataset.
- Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822--7831, 2021.
- Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
- Accuracy map of an optical motion capture system with 42 or 21 cameras in a large measurement volume. Journal of biomechanics, 58:237--240, 2017.
- Learning 6d object pose estimation using 3d object coordinates. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 536--551. Springer, 2014.
- Omni3D: A large benchmark and model for 3D object detection in the wild. arXiv:2207.10660, 2022.
- Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
- Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
- Visual object tracking in first person vision. International Journal of Computer Vision (IJCV), 2022.
- First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 409--419, 2018.
- A framework for evaluating 6-dof object trackers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 582--597, 2018.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995--19012, 2022.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356--5364, 2019.
- Honnotate: A method for 3d annotation of hand and object poses. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3196--3206, 2020.
- Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision--ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I 11, pages 548--562. Springer, 2013.
- T-less: An rgb-d dataset for 6d pose estimation of texture-less objects. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 880--888. IEEE, 2017.
- BOP: Benchmark for 6D object pose estimation. European Conference on Computer Vision (ECCV), 2018.
- Image-to-image translation with conditional adversarial networks. CVPR, 2017.
- TSIT: A simple and versatile framework for image-to-image translation. In ECCV, 2020.
- H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10138--10148, October 2021.
- In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
- Openrooms: An end-to-end open framework for photorealistic indoor scene datasets. arXiv preprint arXiv:2007.12868, 2020.
- Feature pyramid networks for object detection. In ECCV, 2017.
- Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740--755. Springer, 2014.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013--21022, 2022.
- Aria pilot dataset. https://about.facebook.com/realitylabs/projectaria/datasets, 2022.
- Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics (TOG), 35(6):1--11, 2016.
- Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684--10695, 2022.
- Generalized-icp. In Robotics: science and systems, volume 2, page 435. Seattle, WA, 2009.
- Actor and observer: Joint modeling of first and third-person videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7396--7404, 2018.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 567--576, 2015.
- The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
- Egotracks: A long-term egocentric visual object tracking dataset. arXiv preprint arXiv:2301.03213, 2023.
- The double sphere camera model. In 2018 International Conference on 3D Vision (3DV), pages 552--560. IEEE, 2018.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600--612, 2004.
- Camera calibration with distortion models and accuracy evaluation. IEEE Transactions on pattern analysis and machine intelligence, 14(10):965--980, 1992.
- Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
- Mo 2 cap 2: Real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE transactions on visualization and computer graphics, 25(5):2093--2101, 2019.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586--595, 2018.
- Egoglass: Egocentric-view human pose estimation from an eyeglass frame. In 2021 International Conference on 3D Vision (3DV), pages 32--41. IEEE, 2021.