Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception (2306.06362v2)

Published 10 Jun 2023 in cs.CV, cs.AI, and cs.LG

Abstract: We introduce the Aria Digital Twin (ADT) - an egocentric dataset captured using Aria glasses with extensive object, environment, and human level ground truth. This ADT release contains 200 sequences of real-world activities conducted by Aria wearers in two real indoor scenes with 398 object instances (324 stationary and 74 dynamic). Each sequence consists of: a) raw data of two monochrome camera streams, one RGB camera stream, two IMU streams; b) complete sensor calibration; c) ground truth data including continuous 6-degree-of-freedom (6DoF) poses of the Aria devices, object 6DoF poses, 3D eye gaze vectors, 3D human poses, 2D image segmentations, image depth maps; and d) photo-realistic synthetic renderings. To the best of our knowledge, there is no existing egocentric dataset with a level of accuracy, photo-realism and comprehensiveness comparable to ADT. By contributing ADT to the research community, our mission is to set a new standard for evaluation in the egocentric machine perception domain, which includes very challenging research problems such as 3D object detection and tracking, scene reconstruction and understanding, sim-to-real learning, human pose prediction - while also inspiring new machine perception tasks for augmented reality (AR) applications. To kick start exploration of the ADT research use cases, we evaluated several existing state-of-the-art methods for object detection, segmentation and image translation tasks that demonstrate the usefulness of ADT as a benchmarking dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7822--7831, 2021.
  2. Unrealego: A new dataset for robust egocentric 3d human motion capture. In European Conference on Computer Vision (ECCV), 2022.
  3. Accuracy map of an optical motion capture system with 42 or 21 cameras in a large measurement volume. Journal of biomechanics, 58:237--240, 2017.
  4. Learning 6d object pose estimation using 3d object coordinates. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13, pages 536--551. Springer, 2014.
  5. Omni3D: A large benchmark and model for 3D object detection in the wild. arXiv:2207.10660, 2022.
  6. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
  7. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  8. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
  9. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
  10. Visual object tracking in first person vision. International Journal of Computer Vision (IJCV), 2022.
  11. First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 409--419, 2018.
  12. A framework for evaluating 6-dof object trackers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 582--597, 2018.
  13. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995--19012, 2022.
  14. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356--5364, 2019.
  15. Honnotate: A method for 3d annotation of hand and object poses. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3196--3206, 2020.
  16. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision--ACCV 2012: 11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, Part I 11, pages 548--562. Springer, 2013.
  17. T-less: An rgb-d dataset for 6d pose estimation of texture-less objects. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 880--888. IEEE, 2017.
  18. BOP: Benchmark for 6D object pose estimation. European Conference on Computer Vision (ECCV), 2018.
  19. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  20. TSIT: A simple and versatile framework for image-to-image translation. In ECCV, 2020.
  21. H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10138--10148, October 2021.
  22. In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  23. Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
  24. Openrooms: An end-to-end open framework for photorealistic indoor scene datasets. arXiv preprint arXiv:2007.12868, 2020.
  25. Feature pyramid networks for object detection. In ECCV, 2017.
  26. Microsoft coco: Common objects in context. In Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740--755. Springer, 2014.
  27. Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21013--21022, 2022.
  28. Aria pilot dataset. https://about.facebook.com/realitylabs/projectaria/datasets, 2022.
  29. Total3dunderstanding: Joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  30. Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Transactions on Graphics (TOG), 35(6):1--11, 2016.
  31. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In ICCV, 2021.
  32. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684--10695, 2022.
  33. Generalized-icp. In Robotics: science and systems, volume 2, page 435. Seattle, WA, 2009.
  34. Actor and observer: Joint modeling of first and third-person videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7396--7404, 2018.
  35. Sun rgb-d: A rgb-d scene understanding benchmark suite. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 567--576, 2015.
  36. The Replica dataset: A digital replica of indoor spaces. arXiv preprint arXiv:1906.05797, 2019.
  37. Egotracks: A long-term egocentric visual object tracking dataset. arXiv preprint arXiv:2301.03213, 2023.
  38. The double sphere camera model. In 2018 International Conference on 3D Vision (3DV), pages 552--560. IEEE, 2018.
  39. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600--612, 2004.
  40. Camera calibration with distortion models and accuracy evaluation. IEEE Transactions on pattern analysis and machine intelligence, 14(10):965--980, 1992.
  41. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.
  42. Mo 2 cap 2: Real-time mobile 3d motion capture with a cap-mounted fisheye camera. IEEE transactions on visualization and computer graphics, 25(5):2093--2101, 2019.
  43. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586--595, 2018.
  44. Egoglass: Egocentric-view human pose estimation from an eyeglass frame. In 2021 International Conference on 3D Vision (3DV), pages 32--41. IEEE, 2021.
Citations (28)

Summary

We haven't generated a summary for this paper yet.