StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset (2407.20545v1)
Abstract: Modeling and capturing the 3D spatial arrangement of the human and the object is the key to perceiving 3D human-object interaction from monocular images. In this work, we propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation. Compared with previous works which use contact map or implicit distance filed to encode 3D human-object spatial relations, our method is a simple and efficient way to encode the highly detailed spatial correlation between the human and object. Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image. During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples based on this posterior distribution and minimizing the 2D-3D corresponding reprojection loss. Extensive experimental results show that our method achieves impressive results on two challenging benchmarks, BEHAVE and InterCap datasets.
- Flag: Flow-based 3d avatar generation from sparse observations. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13253–13262, June 2022.
- Behave: Dataset and method for tracking human object interactions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 15935–15946, June 2022.
- 6d camera relocalization in ambiguous scenes via continuous multimodal inference. In Eur. Conf. Comput. Vis., pages 139–157, 2020.
- Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In Int. Conf. Comput. Vis., pages 8648–8657, October 2019.
- Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2781–2790, June 2022.
- Deep global registration. In IEEE Conf. Comput. Vis. Pattern Recog., pages 2514–2523, June 2020.
- Density estimation using real nvp. In International Conference on Learning Representations, 2017.
- Detecting and recognizing human-object interactions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8359–8367, June 2018.
- Resolving 3d human pose ambiguities with 3d scene constraints. In Int. Conf. Comput. Vis., pages 2282–2292, October 2019.
- MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics, 39(4):236:1–236:14, 2020.
- Reducing the dimensionality of data with neural networks. Science, 313:504 – 507, 2006.
- Capturing and inferring dense full-body human-scene contact. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13274–13285, June 2022.
- Intercap: Joint markerless 3d tracking of humans and objects in interaction. In Pattern Recognition, pages 281–299. Springer International Publishing, 2022.
- End-to-end recovery of human shape and pose. In IEEE Conf. Comput. Vis. Pattern Recog., pages 7122–7131, June 2018.
- Grasping field: Learning implicit representations for human grasps. In 2020 International Conference on 3D Vision (3DV), pages 333–344, 2020.
- Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Int. Conf. Comput. Vis., pages 1521–1529, Oct 2017.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Int. Conf. Comput. Vis., pages 2252–2261, October 2019.
- Probabilistic modeling for human mesh recovery. In Int. Conf. Comput. Vis., pages 11605–11614, October 2021.
- Cdpn: Coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Int. Conf. Comput. Vis., pages 7678–7687, October 2019.
- Hybridcap: Inertia-aid monocular capture of challenging human motions. In AAAI, February 2023.
- End-to-end human pose and mesh reconstruction with transformers. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1954–1963, June 2021.
- Lu Liu and Robby T. Tan. Human object interaction detection using two-direction spatial enhancement and exclusive object prior. Pattern Recognition, 124:108438, 2022.
- SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, October 2015.
- Learning to Dress 3D People in Generative Clothing. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6469–6478, June 2020.
- Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics, (Proc. SIGGRAPH), 36(4), 2017.
- Variational inference with normalizing flows. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530–1538, Lille, France, 07–09 Jul 2015. PMLR.
- PiGraphs: Learning Interaction Snapshots from Observations. ACM Trans. Graph., 35(4), 2016.
- Probabilistic 3d human shape and pose estimation from multiple unconstrained images in the wild. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16094–16104, June 2021.
- Neural free-viewpoint performance rendering under complex human-object interactions. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4651–4660, 2021.
- Vsgnet: Spatial attention network for detecting human object interactions using graph convolutions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13617–13626, June 2020.
- Pose-aware multi-level feature network for human object interaction detection. In Int. Conf. Comput. Vis., pages 9469–9478, October 2019.
- Elepose: Unsupervised 3d human pose estimation by predicting camera elevation and learning normalizing flows on 2d poses. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6635–6645, June 2022.
- Metaavatar: Learning animatable clothed human models from few depth images. In Advances in Neural Information Processing Systems, volume 34, pages 2810–2822. Curran Associates, Inc., 2021.
- HandFlow: Quantifying View-Dependent 3D Ambiguity in Two-Hand Reconstruction with Normalizing Flow. Vision, Modeling, and Visualization, 2022.
- Holistic 3d human and scene mesh estimation from single view images. In IEEE Conf. Comput. Vis. Pattern Recog., pages 334–343, June 2021.
- Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1):37–52, 1987. Proceedings of the Multivariate Statistical Workshop for Geologists and Geochemists.
- Chore: Contact, human and object reconstruction from a single rgb image. In Eur. Conf. Comput. Vis., page 125–145, October 2022.
- Visibility aware human-object interaction tracking from single rgb camera. In IEEE Conf. Comput. Vis. Pattern Recog., June 2023.
- Perceiving 3d human-object spatial arrangements from a single image in the wild. In Eur. Conf. Comput. Vis., page 34–51, 2020.
- Place: Proximity learning of articulation and contact in 3d environments. In 2020 International Conference on 3D Vision (3DV), pages 642–651, 2020.
- Neuraldome: A neural modeling pipeline on multi-view human-object interactions. In IEEE Conf. Comput. Vis. Pattern Recog., June 2023.
- Ikol: Inverse kinematics optimization layer for 3d human pose and shape estimation via gauss-newton differentiation. In AAAI, February 2023.