Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long-Term Human Trajectory Prediction using 3D Dynamic Scene Graphs (2405.00552v4)

Published 1 May 2024 in cs.RO and cs.HC

Abstract: We present a novel approach for long-term human trajectory prediction in indoor human-centric environments, which is essential for long-horizon robot planning in these environments. State-of-the-art human trajectory prediction methods are limited by their focus on collision avoidance and short-term planning, and their inability to model complex interactions of humans with the environment. In contrast, our approach overcomes these limitations by predicting sequences of human interactions with the environment and using this information to guide trajectory predictions over a horizon of up to 60s. We leverage LLMs to predict interactions with the environment by conditioning the LLM prediction on rich contextual information about the scene. This information is given as a 3D Dynamic Scene Graph that encodes the geometry, semantics, and traversability of the environment into a hierarchical representation. We then ground these interaction sequences into multi-modal spatio-temporal distributions over human positions using a probabilistic approach based on continuous-time Markov Chains. To evaluate our approach, we introduce a new semi-synthetic dataset of long-term human trajectories in complex indoor environments, which also includes annotations of human-object interactions. We show in thorough experimental evaluations that our approach achieves a 54% lower average negative log-likelihood and a 26.5% lower Best-of-20 displacement error compared to the best non-privileged (i.e., evaluated in a zero-shot fashion on the dataset) baselines for a time horizon of 60s.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A survey,” Intl. J. of Robotics Research, vol. 39, no. 8, pp. 895–935, 2020.
  2. Z. Cao, H. Gao, K. Mangalam, Q.-Z. Cai, M. Vo, and J. Malik, “Long-term human motion prediction with scene context,” in European Conf. on Computer Vision (ECCV).   Springer, 2020, pp. 387–404.
  3. K. Mangalam, H. Girase, S. Agarwal, K.-H. Lee, E. Adeli, J. Malik, and A. Gaidon, “It is not the journey but the destination: Endpoint conditioned trajectory prediction,” in European Conf. on Computer Vision (ECCV), 2020, pp. 759–776.
  4. T. Salzmann, L. Chiang, M. Ryll, D. Sadigh, C. Parada, and A. Bewley, “Robots that can see: Leveraging human pose for trajectory prediction,” IEEE Robotics and Automation Letters, vol. 8, no. 11, pp. 7090–7097, 2023.
  5. A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese, “Sophie: An attentive gan for predicting paths compliant to social and physical constraints,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1349–1358.
  6. Y. Yuan, X. Weng, Y. Ou, and K. M. Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in Intl. Conf. on Computer Vision (ICCV), 2021, pp. 9813–9823.
  7. T. Salzmann, B. Ivanovic, P. Chakravarty, and M. Pavone, “Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data,” in European Conf. on Computer Vision (ECCV), 2020, pp. 683–700.
  8. S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in Intl. Conf. on Computer Vision (ICCV).   IEEE, 2009, pp. 261–268.
  9. A. Lerner, Y. Chrysanthou, and D. Lischinski, “Crowds by example,” in Computer graphics forum, vol. 26, no. 3.   Wiley Online Library, 2007, pp. 655–664.
  10. D. Brščić, T. Kanda, T. Ikeda, and T. Miyashita, “Person tracking in large public spaces using 3-d range sensors,” IEEE Trans. on Human-Machine Systems, vol. 43, no. 6, pp. 522–534, 2013.
  11. A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in European Conf. on Computer Vision (ECCV), 2016, pp. 549–565.
  12. K. Mangalam, Y. An, H. Girase, and J. Malik, “From goals, waypoints & paths to long term human trajectory forecasting,” in Intl. Conf. on Computer Vision (ICCV), 2021, pp. 15 233–15 242.
  13. I. Armeni, Z. He, J. Gwak, A. Zamir, M. Fischer, J. Malik, and S. Savarese, “3D scene graph: A structure for unified semantics, 3D space, and camera,” in Intl. Conf. on Computer Vision (ICCV), 2019, pp. 5664–5673.
  14. A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone, “Kimera: from SLAM to spatial perception with 3D dynamic scene graphs,” Intl. J. of Robotics Research, vol. 40, no. 12–14, pp. 1510–1546, 2021.
  15. N. Hughes, Y. Chang, S. Hu, R. Talak, R. Abdulhai, J. Strader, and L. Carlone, “Foundations of spatial perception for robotics: Hierarchical representations and real-time systems,” Intl. J. of Robotics Research, 2024.
  16. S. Looper, J. Rodriguez-Puigvert, R. Siegwart, C. Cadena, and L. Schmid, “3d vsg: Long-term semantic scene change prediction through 3d variable scene graphs,” in IEEE Intl. Conf. on Robotics and Automation (ICRA).   IEEE, 2023, pp. 8179–8186.
  17. J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in ACM Symp. on User Interface Software and Technology (UIST), 2023.
  18. D. Huang, O. Hilliges, L. Van Gool, and X. Wang, “Palm: Predicting actions through language models @ ego4d long-term action anticipation challenge,” arXiv preprint arXiv:2306.16545, 2023.
  19. S. Kim, D. Huang, Y. Xian, O. Hilliges, L. Van Gool, and X. Wang, “Lalm: Long-term action anticipation with language models,” arXiv preprint arXiv:2311.17944, 2023.
  20. N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 336–345.
  21. T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Neighbourhood context embeddings in deep inverse reinforcement learning for predicting pedestrian motion over long time horizons,” in Intl. Conf. on Computer Vision (ICCV) Workshops, 2019.
  22. H. Tran, V. Le, and T. Tran, “Goal-driven long-term trajectory prediction,” in IEEE Workshop on Applications of Computer Vision (WACV), 2021, pp. 796–805.
  23. A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2255–2264.
  24. A. Mohamed, K. Qian, M. Elhoseiny, and C. Claudel, “Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 14 424–14 432.
  25. Y. Zhu, A. Rudenko, T. P. Kucner, L. Palmieri, K. O. Arras, A. J. Lilienthal, and M. Magnusson, “Cliff-lhmp: Using spatial dynamics patterns for long-term human motion prediction,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 3795–3802.
  26. Y. Zhu, H. Fan, A. Rudenko, M. Magnusson, E. Schaffernicht, and A. J. Lilienthal, “Lace-lhmp: Airflow modelling-inspired long-term human motion prediction by enhancing laminar characteristics in human flow,” arXiv preprint arXiv:2403.13640, 2024.
  27. L. Bruckschen, K. Bungert, N. Dengler, and M. Bennewitz, “Predicting human navigation goals based on bayesian inference and activity regions,” Robotics and Autonomous Systems, vol. 134, p. 103664, 2020.
  28. Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” Intl. J. of Computer Vision, vol. 130, no. 5, pp. 1366–1401, 2022.
  29. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epic-kitchens dataset,” in European Conf. on Computer Vision (ECCV), 2018, pp. 720–736.
  30. F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao, “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 21 096–21 106.
  31. C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K. Murphy, and C. Schmid, “Relational action forecasting,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2019.
  32. Y. Abu Farha, A. Richard, and J. Gall, “When will you do what? - anticipating temporal occurrences of activities,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), June 2018.
  33. A. Rudenko, T. P. Kucner, C. S. Swaminathan, R. T. Chadalavada, K. O. Arras, and A. J. Lilienthal, “Thör: Human-robot navigation data collection and accurate motion trajectories dataset,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 676–682, 2020.
  34. T. R. de Almeida, A. Rudenko, T. Schreiter, Y. Zhu, E. G. Maestro, L. Morillo-Mendez, T. P. Kucner, O. M. Mozos, M. Magnusson, L. Palmieri et al., “Thor-magni: Comparative analysis of deep learning models for role-conditioned human motion prediction,” in Intl. Conf. on Computer Vision (ICCV), 2023, pp. 2200–2209.
  35. M. N. Finean, L. Petrović, W. Merkt, I. Marković, and I. Havoutis, “Motion planning in dynamic environments using context-aware human trajectory prediction,” Robotics and Autonomous Systems, vol. 166, p. 104450, 2023.
  36. M. Hassan, V. Choutas, D. Tzionas, and M. J. Black, “Resolving 3D human pose ambiguities with 3D scene constraints,” in Intl. Conf. on Computer Vision (ICCV), 2019, pp. 2282–2292.
  37. Z. Yan, T. Duckett, and N. Bellotto, “Online learning for human classification in 3d lidar-based tracking,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2017.
  38. R. Martin-Martin, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese, “Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments,” IEEE Trans. Pattern Anal. Machine Intell., 2021.
  39. H. Karnan, A. Nair, X. Xiao, G. Warnell, S. Pirk, A. Toshev, J. Hart, J. Biswas, and P. Stone, “Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation,” IEEE Robotics and Automation Letters, 2022.
  40. Y. F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2017, pp. 1343–1350.
  41. M. Hassan, D. Ceylan, R. Villegas, J. Saito, J. Yang, Y. Zhou, and M. Black, “Stochastic scene-aware motion prediction,” in Intl. Conf. on Computer Vision (ICCV), Oct. 2021.
  42. A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans,” in Robotics: Science and Systems (RSS), 2020.
  43. N. Hughes, Y. Chang, and L. Carlone, “Hydra: a real-time spatial perception engine for 3D scene graph construction and optimization,” in Robotics: Science and Systems (RSS), 2022.
  44. K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suenderhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” in Conference on Robot Learning (CoRL), 2023.
  45. OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  46. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “SMPL: A skinned multi-person linear model,” ACM Trans. Graphics (Proc. SIGGRAPH Asia), vol. 34, no. 6, pp. 248:1–248:16, Oct. 2015.
  47. Z. Ravichandran, J. D. Griffith, B. Smith, and C. Frost, “Bridging scene understanding and task execution with flexible simulation environments,” arXiv preprint arXiv:2011.10452, 2020.
  48. L. M. Dang, K. Min, H. Wang, M. J. Piran, C. H. Lee, and H. Moon, “Sensor-based and vision-based human activity recognition: A comprehensive survey,” Pattern Recognition, vol. 108, p. 107561, 2020.
  49. C. Feichtenhofer, Y. Li, K. He et al., “Masked autoencoders as spatiotemporal learners,” Advances in Neural Information Processing Systems (NIPS), vol. 35, pp. 35 946–35 958, 2022.
Citations (1)

Summary

  • The paper introduces a language-driven probabilistic modeling approach that uses 3D dynamic scene graphs to extend prediction horizons to 60 seconds.
  • It leverages large language models to forecast human-object interactions and encode detailed environmental context for improved trajectory prediction.
  • Extensive validation on a novel semi-synthetic dataset demonstrates reduced errors, advancing proactive human-robot collaboration in dynamic spaces.

Exploring Long-term Human Trajectory Prediction with 3D Dynamic Scene Graphs

Introduction to Human Trajectory Prediction

Long-term Scenario: Understanding human motion in environments bustling with activity, like offices or homes, is crucial for robots designed to interact and coexist with humans. Current trajectory prediction methods focus mainly on short spans (up to 10 seconds), largely for collision avoidance. These methods generally falter in complex, interaction-dense settings typical of human-centered spaces.

Breaking New Ground: This setup extends the prediction horizon notably up to 60 seconds, handling the nuances of human-object interactions and their impacts on movement, a significant leap from standard short-term forecasting methods. The tool used for this broadened perspective? A blend of LLMs for predicting interactions and 3D dynamic scene graphs (DSGs) for rich, contextual environment modeling.

Key Components and Methods

  1. Utilizing 3D Dynamic Scene Graphs: The research leverages DSGs to encode detailed information about the environment, such as geometry and traversability, into a structured, hierarchical format that machines can understand and process. This complex representation allows for more nuanced reasoning about the space humans are navigating.
  2. Predicting Human-Environment Interactions: At the heart of their approach is the predictive power of LLMs, which estimate potential human-object interactions within the environment. This prediction isn't just about where a human might move but involves understanding potential interactions with objects in their path, adding a layer of depth to trajectory forecasting.
  3. Probabilistic Trajectory Modeling: Once potential interactions are identified, the system employs probabilistic models (specifically, continuous-time Markov Chains) to map these interactions into likely pathways and positions in a given timeframe. Referred to in the paper as the Language-driven Probabilistic Long-term Prediction (LP2), this method focuses on where humans might end up, factoring in their interactions with their surroundings.
  4. Extensive Validation with a Novel Dataset: Recognizing the lack of suitable datasets for training and testing their model, the researchers developed a new semi-synthetic dataset featuring complex indoor environments and detailed annotations of human-object interactions. Their method excelled in experimental benchmarks, notably reducing errors in trajectory prediction over a 60-second horizon.

Practical Implications and Theoretical Contributions

Beyond Collision Avoidance: By enhancing trajectory prediction from mere collision avoidance to include proactive interaction understanding, this research paves the way for robots that can more effectively assist, collaborate with, and adapt to humans in shared spaces.

Open-Source Resources: The commitment to open-source sharing of their method and dataset encourages further development and application of this advanced prediction framework, potentially accelerating advancements in robotic systems designed for complex human environments.

Foundation for Future Innovation: The incorporation of LLMs into trajectory prediction models opens intriguing possibilities for even more dynamic interaction models, potentially extending to predicting interactions among multiple agents or more complicated human behaviors.

Future Directions in AI and Robotics

The integration of LLMs for interaction prediction and dynamic scene graphs for environmental modeling sets a promising direction for the development of intelligent systems that understand and anticipate human needs and actions. Future enhancements could include refining these models for greater accuracy over longer periods or expanding them to predict interactions in even more complex or unpredictable environments. As AI continues to evolve, the intersection with robotics in shared human spaces remains a fertile ground for groundbreaking research and transformative applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com