Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures (2312.02963v1)

Published 5 Dec 2023 in cs.CV

Abstract: In this era, the success of LLMs and text-to-image models can be attributed to the driving force of large-scale datasets. However, in the realm of 3D vision, while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet, a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap, we present MVHumanNet, a dataset that comprises multi-view human action sequences of 4,500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system, which facilitates easily scalable data collection. Our dataset contains 9,000 daily outfits, 60,000 motion sequences and 645 million frames with extensive annotations, including human masks, camera parameters, 2D and 3D keypoints, SMPL/SMPLX parameters, and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks, we conducted pilot studies on view-consistent action recognition, human NeRF reconstruction, text-driven view-unconstrained human image generation, as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset, we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. 3d people. https://3dpeople.com/.
  2. AXYZ. https://secure.axyz-design.com/.
  3. Faust: Dataset and evaluation for 3d mesh registration. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2014.
  4. Dynamic faust: Registering human bodies in motion. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  5. Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In European Conference on Computer Vision. Springer, 2022.
  6. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, pages 7291–7299, 2017.
  7. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  8. Geometry-guided progressive nerf for generalizable and efficient neural human rendering. In ECCV, pages 222–239. Springer, 2022.
  9. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11594–11604, 2021a.
  10. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In ICCV, 2021b.
  11. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019.
  12. Generalizable neural performer: Learning robust radiance fields for human novel view synthesis. arXiv preprint arXiv:2204.11798, 2022.
  13. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In ICCV, 2023.
  14. Infogcn: Representation learning for human skeleton-based action recognition. In CVPR, 2022.
  15. CMU Graphics Lab. http://mocap.cs.cmu.edu/.
  16. Blender Online Community. Blender - a 3d modelling and rendering package, 2018. http://www.blender.org.
  17. EasyMocap Contributors. Easymocap - make human motion capture easier. Github, 2021. https://github.com/zju3dv/EasyMocap.
  18. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
  19. Objaverse: A universe of annotated 3d objects. In CVPR, 2023b.
  20. Insetgan for full-body image generation. In CVPR, 2022.
  21. Stylegan-human: A data-centric odyssey of human generation. In ECCV. Springer, 2022.
  22. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35, 2022.
  23. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
  24. Deepcap: Monocular human performance capture using weak supervision. In CVPR, 2020.
  25. Real-time deep dynamic characters. ACM Transactions on Graphics (ToG), 40(4), 2021.
  26. High-fidelity 3d human digitization from single 2k resolution images. In CVPR, 2023.
  27. Eva3d: Compositional 3d human generation from 2d image collections. arXiv preprint arXiv:2210.04888, 2022a.
  28. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. ACM Transactions on Graphics (TOG), 41(4):1–19, 2022b.
  29. Lrm: Large reconstruction model for single image to 3d, 2023.
  30. Humanrf: High-fidelity neural radiance fields for humans in motion. ACM Transactions on Graphics (TOG), 42(4), 2023.
  31. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
  32. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  33. Humangen: Generating human radiance fields with explicit priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  34. Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022.
  35. Panoptic studio: A massively multiview system for social motion capture. In ICCV, 2015.
  36. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  37. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  38. Segment anything. In ICCV, 2023a.
  39. Segment anything. In ICCV, pages 4015–4026, 2023b.
  40. Neural human performer: Learning generalizable radiance fields for human performance rendering. Advances in Neural Information Processing Systems, 2021.
  41. Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021.
  42. Robust high-resolution video matting with temporal guidance. In WACV, pages 238–247, 2022.
  43. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 2019.
  44. Neural actor: Neural free-view synthesis of human actors with pose control. ACM transactions on graphics (TOG), 40(6), 2021.
  45. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
  46. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016.
  47. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics(ToG), 34(6), 2015.
  48. Learning to dress 3d people in generative clothing. In CVPR, 2020.
  49. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, 2019.
  50. Monocular 3d human pose estimation in the wild using improved cnn supervision. In International conference on 3D vision (3DV). IEEE, 2017.
  51. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019.
  52. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  53. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4), 2022.
  54. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019.
  55. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. In CVPR, 2021.
  56. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  57. Renderpeople. https://renderpeople.com/.
  58. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  59. Huang Geng Sai Charan Mahadevan, Karunanidhi Durai Ku-mar. https://mocap.cs.sfu.ca/.
  60. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, 2019.
  61. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In CVPR, 2020.
  62. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, 2022.
  63. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR, 2016.
  64. X-avatar: Expressive human avatars. In CVPR, 2023.
  65. Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing. In ISMIR, 2019.
  66. Twindom. https://web.twindom.com/.
  67. Articulated mesh animation from multi-view silhouettes. ACM Transactions on Graphics (TOG), 27(3):1–9, 2008.
  68. Dynamic shape capture using multi-view photometric stereo. In ACM SIGGRAPH Asia 2009 papers, pages 1–11. 2009.
  69. Cross-view action modeling, learning and recognition. In CVPR, 2014.
  70. Ibrnet: Learning multi-view image-based rendering. In CVPR, 2021.
  71. Multi-view neural human rendering. In CVPR, 2020.
  72. Get3dhuman: Lifting stylegan-human into a 3d generative model using pixel-aligned reconstruction priors. In ICCV, 2023.
  73. Icon: Implicit clothed humans obtained from normals. In CVPR, 2022.
  74. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In CVPR, pages 5746–5756, 2021.
  75. Mvimgnet: A large-scale dataset of multi-view images. In CVPR, 2023.
  76. Humbi: A large multiview dataset of human body expressions. In CVPR, 2020.
  77. Dwnet: Dense warp-based network for pose-guided human video generation. arXiv preprint arXiv:1910.09139, 2019.
  78. Detailed, accurate, human shape estimation from clothed 3d scan sequences. In CVPR, 2017.
  79. Deepmulticap: Performance capture of multiple characters using sparse multiview cameras. In ICCV, 2021.
  80. Deephuman: 3d human reconstruction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  81. Structured local radiance fields for human avatar modeling. In CVPR, 2022.
  82. Learning discriminative representations for skeleton based action recognition. In ICCV, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Zhangyang Xiong (6 papers)
  2. Chenghong Li (8 papers)
  3. Kenkun Liu (5 papers)
  4. Hongjie Liao (5 papers)
  5. Jianqiao Hu (5 papers)
  6. Junyi Zhu (46 papers)
  7. Shuliang Ning (7 papers)
  8. Lingteng Qiu (18 papers)
  9. Chongjie Wang (3 papers)
  10. Shijie Wang (62 papers)
  11. Shuguang Cui (275 papers)
  12. Xiaoguang Han (118 papers)
Citations (5)

Summary

In the field of computer vision and AI, the development of models capable of understanding and generating 3D human figures is a rapidly advancing area. The availability of large and diverse datasets has proven instrumental in enhancing the performance of AI models across various domains, from language processing to image synthesis. A recent addition to this field is the creation of MVHumanNet—arguably the most extensive dataset to date focusing on 3D representations of humans in everyday clothing.

MVHumanNet encompasses a considerable scale, boasting details from 4,500 individual human subjects dressed in 9,000 different outfits and captured in 60,000 motion sequences, resulting in an extraordinary 645 million frames of data. What sets this dataset apart is not just its volume, but the depth of its annotations, essential for the nuanced understanding and generation of human figures. Annotations include detailed human masks, camera calibration parameters, 2D and 3D keypoints, SMPL/SMPLX body models, and correlated textual metadata.

The creation of MVHumanNet involved a multi-view capture system with up to 48 high-resolution cameras, which allowed for efficient data gathering while covering a broad spectrum of human ages, body types, motions, and clothing styles. The diverse range of daily outfits and actions ensures that the dataset emulates real-world human appearances and activities.

To demonstrate the efficacy of MVHumanNet, researchers conducted a series of experiments spanning from action recognition to novel view synthesis and generative modeling tasks. In the action recognition trial, models trained on MVHumanNet showed improved accuracy when the number of camera views increased, demonstrating that more data translates to better understanding and prediction. In terms of human reconstruction using Neural Radiance Fields (NeRF), experiments indicated that more substantial training data sets facilitated the generalization of models to novel poses and clothing types, which is critical for creating accurate and versatile digital human representations.

Furthermore, the dataset proved valuable in text-driven image generation tasks, where the ability to generate high-quality human images consistent with SMPL conditions and textual descriptions was significantly enhanced with increased training data scale. The implications of such capabilities extend to creating avatars, fashion modeling, and even virtual try-on applications.

Lastly, the dataset facilitated the creation of generative models that could produce textured 3D meshes from high-resolution full-body images. This marks a significant leap forward from prior datasets that either relied on synthetic data or operated in restricted view-settings. The results suggest that scaling up data has a marked positive effect on performance, a promising signal for future research endeavors in 3D human model generation.

In summary, MVHumanNet is a transformative new dataset that stands to accelerate progress significantly in several branches of AI that deal with digital human representation and generation. It exemplifies the power of large-scale, detailed datasets to push the boundaries of what AI models can achieve, providing the groundwork for a future where the virtual representation of humans is as detailed and nuanced as their real-world counterparts.