PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios (2404.19722v1)
Abstract: We address the challenge of content diversity and controllability in pedestrian simulation for driving scenarios. Recent pedestrian animation frameworks have a significant limitation wherein they primarily focus on either following trajectory [46] or the content of the reference video [57], consequently overlooking the potential diversity of human motion within such scenarios. This limitation restricts the ability to generate pedestrian behaviors that exhibit a wider range of variations and realistic motions and therefore restricts its usage to provide rich motion content for other components in the driving simulation system, e.g., suddenly changed motion to which the autonomous vehicle should respond. In our approach, we strive to surpass the limitation by showcasing diverse human motions obtained from various sources, such as generated human motions, in addition to following the given trajectory. The fundamental contribution of our framework lies in combining the motion tracking task with trajectory following, which enables the tracking of specific motion parts (e.g., upper body) while simultaneously following the given trajectory by a single policy. This way, we significantly enhance both the diversity of simulated human motion within the given scenario and the controllability of the content, including language-based control. Our framework facilitates the generation of a wide range of human motions, contributing to greater realism and adaptability in pedestrian simulations for driving scenarios. More information is on our project page https://wangjingbo1219.github.io/papers/CVPR2024_PACER_PLUS/PACERPLUSPage.html .
- Scape: shape completion and animation of people. In ACM SIGGRAPH 2005 Papers, pages 408–416. 2005.
- Exploiting temporal context for 3d human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3395–3404, 2019.
- Pmp: Learning to physically interact with environments using part-wise motion priors. ACM Transactions On Graphics (TOG), 2023.
- Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023.
- Geosim: Realistic video simulation via geometry-aware composition for self-driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7230–7240, 2021.
- Learning 3d human pose from structure and motion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 668–683, 2018.
- Gravity-aware monocular 3d human-object reconstruction. In ICCV, 2021.
- C· ase: Learning conditional adversarial skillembeddings for physics-based characters. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023a.
- Tore: Token reduction for efficient human mesh recovery withtransformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15143–15155, 2023b.
- Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017.
- Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV), pages 421–430. IEEE, 2017.
- Kama: 3d keypoint aware body mesh articulation. In 2021 International Conference on 3D Vision (3DV), 2021.
- Exemplar fine-tuning for 3d human pose fitting towards in-the-wild 3d human pose estimation. 2020.
- Padl: Language-directed physics-based character control. ACM Transactions on Graphics (TOG), 2022.
- End-to-end recovery of human shape and pose. In Computer Vision and Pattern Regognition (CVPR), 2018.
- Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5253–5263, 2020.
- Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019.
- Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6050–6059, 2017.
- Data-driven biped control. In ACM SIGGRAPH 2010 papers, pages 1–8. 2010.
- Learning to generate diverse dance motions with transformer. arXiv, 2020.
- Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, 2021.
- D&d: Learning human dynamics from dynamic camera. In ECCV, 2022.
- NIKI: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023b.
- Character controllers using motion vaes. ACM Trans. Graph., 39(4), 2020.
- Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
- Dynamics-regulated kinematic policy for egocentric pose estimation. NIPS, 2021.
- Embodied scene-aware human pose estimation. In Advances in Neural Information Processing Systems, 2022.
- Universal humanoid motion representations for physics-based control. arXiv preprint arXiv:2310.04582, 2023a.
- Perpetual humanoid control for real-time simulated avatars. In International Conference on Computer Vision (ICCV), 2023b.
- Amass: Archive of motion capture as surface shapes. In ICCV, 2019.
- Isaac gym: High performance gpu-based physics simulation for robot learning, 2021.
- Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG), 36(4):1–14, 2017.
- Single-shot multi-person 3d pose estimation from monocular rgb. In 2018 International Conference on 3D Vision (3DV), pages 120–130. IEEE, 2018.
- Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In CVPR, 2019.
- Contact-aware nonlinear control of dynamic characters. In ACM SIGGRAPH 2009 papers, pages 1–9. 2009.
- NVIDIA. Drive sim. https://developer.nvidia.com/drive/simulation, 2020.
- OpenAI. Chatgpt. https://openai.com/chatgpt, 2020.
- Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 459–468, 2018.
- 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019.
- Deepmimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
- Mcp: Learning composable hierarchical control with multiplicative compositional policies. In Advances in Neural Information Processing Systems, pages 3681–3692, 2019.
- Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 2021.
- Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics (TOG), 2022.
- Action-conditioned 3d human motion synthesis with transformer vae. In ICCV, pages 10985–10995, 2021.
- Contact and human dynamics from monocular video. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Humor: 3d human motion model for robust pose estimation. In ICCV, 2021.
- Generating useful accident-prone driving scenarios via a learned traffic prior. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Trace and pace: Controllable pedestrian animation via guided trajectory diffusion. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Insactor: Instruction-driven physics-based characters. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Physcap: Physically plausible monocular 3d motion capture in real time. ACM Transactions on Graphics, 39(6), 2020.
- Neural monocular 3d human motion capture with physical awareness. ACM Transactions on Graphics, 40(4), 2021.
- Human body model fitting by learned gradient descent. In ECCV, 2020.
- Human mesh recovery from monocular images via a skeleton-disentangled representation. In Proceedings of the IEEE International Conference on Computer Vision, pages 5349–5358, 2019.
- Trafficsim: Learning to simulate realistic multi-agent behaviors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10400–10409, 2021.
- Indirect deep structured learning for 3d human body shape and pose prediction. 2017.
- Calm: Conditional adversarial latent models for directable virtual characters. ACM Transactions on Graphics (TOG), 2023.
- Human motion diffusion model. In The Eleventh International Conference on Learning Representations, 2023.
- Tlcontrol: Trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135, 2023.
- Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation. In ECCV, 2020.
- Learning human dynamics in autonomous driving scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- A scalable approach to control diverse behaviors for physically simulated characters. ACM Transactions on Graphics (TOG), 39(4):33–1, 2020.
- Physics-based character controllers using conditional vaes. ACM Transactions on Graphics (TOG), 41(4):1–12, 2022.
- Physics-based human motion estimation and synthesis from videos. In ICCV, 2021.
- Composite motion learning with task control. ACM Transactions on Graphics (TOG), 2023a.
- Adaptnet: Policy adaptation for physics-based character control. ACM Transactions on Graphics (TOG), 2023b.
- ViTPose: Simple vision transformer baselines for human pose estimation. In Advances in Neural Information Processing Systems, 2022.
- Controlvae: Model-based learning of generative controllers for physics-based characters. ACM Transactions on Graphics (TOG), 2022.
- Moconvq: Unified physics-based motion control via scalable discrete representations. arXiv preprint arXiv:2310.10198, 2023.
- Simbicon: Simple biped locomotion control. ACM Transactions on Graphics (TOG), 26(3):105–es, 2007.
- Human dynamics from monocular video with dynamic camera movements. ACM Transactions on Graphics (TOG), 40(6):1–14, 2021.
- Ego-pose estimation and forecasting as real-time pd control. In Proceedings of the IEEE International Conference on Computer Vision, pages 10082–10092, 2019.
- Residual force control for agile human behavior imitation and extended motion synthesis. In Advances in Neural Information Processing Systems, 2020.
- Simpoe: Simulated character control for 3d human pose estimation. In CVPR, 2021.
- Glamr: Global occlusion-aware human mesh recovery with dynamic cameras. In CVPR, 2022.
- Physdiff: Physics-guided human motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Weakly-supervised learning of human dynamics. In ECCV, 2020.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving. In CVPR, 2022.
- Emdm: Efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256, 2023.
- On the continuity of rotation representations in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.