Long-Term Human Trajectory Prediction using 3D Dynamic Scene Graphs (2405.00552v4)
Abstract: We present a novel approach for long-term human trajectory prediction in indoor human-centric environments, which is essential for long-horizon robot planning in these environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage LLMs to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged (i.e., evaluated in a zero-shot fashion on the dataset) baselines for a time horizon of 60s.
- A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A survey,” Intl. J. of Robotics Research, vol. 39, no. 8, pp. 895–935, 2020.
- Z. Cao, H. Gao, K. Mangalam, Q.-Z. Cai, M. Vo, and J. Malik, “Long-term human motion prediction with scene context,” in European Conf. on Computer Vision (ECCV). Springer, 2020, pp. 387–404.
- K. Mangalam, H. Girase, S. Agarwal, K.-H. Lee, E. Adeli, J. Malik, and A. Gaidon, “It is not the journey but the destination: Endpoint conditioned trajectory prediction,” in European Conf. on Computer Vision (ECCV), 2020, pp. 759–776.
- T. Salzmann, L. Chiang, M. Ryll, D. Sadigh, C. Parada, and A. Bewley, “Robots that can see: Leveraging human pose for trajectory prediction,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7090–7097, 2023.
- A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese, “Sophie: An attentive gan for predicting paths compliant to social and physical constraints,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1349–1358.
- Y. Yuan, X. Weng, Y. Ou, and K. M. Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in Intl. Conf. on Computer Vision (ICCV), 2021, pp. 9813–9823.
- T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data,” in European Conf. on Computer Vision (ECCV), 2020, pp. 683–700.
- S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in Intl. Conf. on Computer Vision (ICCV). IEEE, 2009, pp. 261–268.
- A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,” in Computer graphics forum, vol. 26, no. 3. Wiley Online Library, 2007, pp. 655–664.
- D. Brščić, T. Kanda, T. Ikeda, and T. Miyashita, “Person tracking in large public spaces using 3-d range sensors,” IEEE Trans. on Human-Machine Systems, vol. 43, no. 6, pp. 522–534, 2013.
- A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in European Conf. on Computer Vision (ECCV), 2016, pp. 549–565.
- K. Mangalam, Y. An, H. Girase, and J. Malik, “From goals, waypoints & paths to long term human trajectory forecasting,” in Intl. Conf. on Computer Vision (ICCV), 2021, pp. 15 233–15 242.
- I. Armeni, Z. He, J. Gwak, A. Zamir, M. Fischer, J. Malik, and S. Savarese, “3D scene graph: A structure for unified semantics, 3D space, and camera,” in Intl. Conf. on Computer Vision (ICCV), 2019, pp. 5664–5673.
- A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: from SLAM to spatial perception with 3D dynamic scene graphs,” Intl. J. of Robotics Research, vol. 40, no. 12–14, pp. 1510–1546, 2021.
- N. Hughes, Y. Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone, “Foundations of spatial perception for robotics: Hierarchical representations and real-time systems,” Intl. J. of Robotics Research, 2024.
- S. Looper, J. Rodriguez-Puigvert, R. Siegwart, C. Cadena, and L. Schmid, “3d vsg: Long-term semantic scene change prediction through 3d variable scene graphs,” in IEEE Intl. Conf. on Robotics and Automation (ICRA). IEEE, 2023, pp. 8179–8186.
- J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in ACM Symp. on User Interface Software and Technology (UIST), 2023.
- D. Huang, O. Hilliges, L. Van Gool, and X. Wang, “Palm: Predicting actions through language models @ ego4d long-term action anticipation challenge,” arXiv preprint arXiv:2306.16545, 2023.
- S. Kim, D. Huang, Y. Xian, O. Hilliges, L. Van Gool, and X. Wang, “Lalm: Long-term action anticipation with language models,” arXiv preprint arXiv:2311.17944, 2023.
- N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 336–345.
- T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Neighbourhood context embeddings in deep inverse reinforcement learning for predicting pedestrian motion over long time horizons,” in Intl. Conf. on Computer Vision (ICCV) Workshops, 2019.
- H. Tran, V. Le, and T. Tran, “Goal-driven long-term trajectory prediction,” in IEEE Workshop on Applications of Computer Vision (WACV), 2021, pp. 796–805.
- A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2255–2264.
- A. Mohamed, K. Qian, M. Elhoseiny, and C. Claudel, “Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 14 424–14 432.
- Y. Zhu, A. Rudenko, T. P. Kucner, L. Palmieri, K. O. Arras, A. J. Lilienthal, and M. Magnusson, “Cliff-lhmp: Using spatial dynamics patterns for long-term human motion prediction,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3795–3802.
- Y. Zhu, H. Fan, A. Rudenko, M. Magnusson, E. Schaffernicht, and A. J. Lilienthal, “Lace-lhmp: Airflow modelling-inspired long-term human motion prediction by enhancing laminar characteristics in human flow,” arXiv preprint arXiv:2403.13640, 2024.
- L. Bruckschen, K. Bungert, N. Dengler, and M. Bennewitz, “Predicting human navigation goals based on bayesian inference and activity regions,” Robotics and Autonomous Systems, vol. 134, p. 103664, 2020.
- Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” Intl. J. of Computer Vision, vol. 130, no. 5, pp. 1366–1401, 2022.
- D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epic-kitchens dataset,” in European Conf. on Computer Vision (ECCV), 2018, pp. 720–736.
- F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao, “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 096–21 106.
- C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K. Murphy, and C. Schmid, “Relational action forecasting,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Y. Abu Farha, A. Richard, and J. Gall, “When will you do what? - anticipating temporal occurrences of activities,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2018.
- A. Rudenko, T. P. Kucner, C. S. Swaminathan, R. T. Chadalavada, K. O. Arras, and A. J. Lilienthal, “Thör: Human-robot navigation data collection and accurate motion trajectories dataset,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 676–682, 2020.
- T. R. de Almeida, A. Rudenko, T. Schreiter, Y. Zhu, E. G. Maestro, L. Morillo-Mendez, T. P. Kucner, O. M. Mozos, M. Magnusson, L. Palmieri et al., “Thor-magni: Comparative analysis of deep learning models for role-conditioned human motion prediction,” in Intl. Conf. on Computer Vision (ICCV), 2023, pp. 2200–2209.
- M. N. Finean, L. Petrović, W. Merkt, I. Marković, and I. Havoutis, “Motion planning in dynamic environments using context-aware human trajectory prediction,” Robotics and Autonomous Systems, vol. 166, p. 104450, 2023.
- M. Hassan, V. Choutas, D. Tzionas, and M. J. Black, “Resolving 3D human pose ambiguities with 3D scene constraints,” in Intl. Conf. on Computer Vision (ICCV), 2019, pp. 2282–2292.
- Z. Yan, T. Duckett, and N. Bellotto, “Online learning for human classification in 3d lidar-based tracking,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2017.
- R. Martin-Martin, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese, “Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments,” IEEE Trans. Pattern Anal. Machine Intell., 2021.
- H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone, “Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,” IEEE Robotics and Automation Letters, 2022.
- Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2017, pp. 1343–1350.
- M. Hassan, D. Ceylan, R. Villegas, J. Saito, J. Yang, Y. Zhou, and M. Black, “Stochastic scene-aware motion prediction,” in Intl. Conf. on Computer Vision (ICCV), Oct. 2021.
- A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” in Robotics: Science and Systems (RSS), 2020.
- N. Hughes, Y. Chang, and L. Carlone, “Hydra: a real-time spatial perception engine for 3D scene graph construction and optimization,” in Robotics: Science and Systems (RSS), 2022.
- K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” in Conference on Robot Learning (CoRL), 2023.
- OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM Trans. Graphics (Proc. SIGGRAPH Asia), vol. 34, no. 6, pp. 248:1–248:16, Oct. 2015.
- Z. Ravichandran, J. D. Griffith, B. Smith, and C. Frost, “Bridging scene understanding and task execution with flexible simulation environments,” arXiv preprint arXiv:2011.10452, 2020.
- L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon, “Sensor-based and vision-based human activity recognition: A comprehensive survey,” Pattern Recognition, vol. 108, p. 107561, 2020.
- C. Feichtenhofer, Y. Li, K. He et al., “Masked autoencoders as spatiotemporal learners,” Advances in Neural Information Processing Systems (NIPS), vol. 35, pp. 35 946–35 958, 2022.