Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR (2402.14461v2)

Published 22 Feb 2024 in cs.CV

Abstract: Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR). However, previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection. This pipeline may potentially compromise the flexibility of learning multimodal representations, consequently constraining the overall effectiveness. In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR, aimed to complementally leverage multi-view 2D scenes and 3D point clouds for SGG in an end-to-end manner. Concretely, our model embraces a View-Sync Transfusion scheme to encourage multi-view visual information interaction. Concurrently, a Geometry-Visual Cohesion operation is designed to integrate the synergic 2D semantic features into 3D point cloud features. Moreover, based on the augmented feature, we propose a novel relation-sensitive transformer decoder that embeds dynamic entity-pair queries and relational trait priors, which enables the direct prediction of entity-pair relations for graph generation without intermediate steps. Extensive experiments have validated the superior SGG performance and lower computational cost of S2Former-OR on 4D-OR benchmark, compared with current OR-SGG methods, e.g., 3 percentage points Precision increase and 24.2M reduction in model parameters. We further compared our method with generic single-stage SGG methods with broader metrics for a comprehensive evaluation, with consistently better performance achieved.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. F. Lalys and P. Jannin, “Surgical process modelling: a review,” Int. J. Comput. Assist. Radiol. Surg., vol. 9, no. 3, pp. 495–511, 2014.
  2. L. Maier-Hein, S. S. Vedula, S. Speidel, N. Navab, R. Kikinis, A. Park, M. Eisenmann, H. Feussner, G. Forestier, S. Giannarou et al., “Surgical data science for next-generation interventions,” Nature Biomedical Engineering, vol. 1, no. 9, pp. 691–696, 2017.
  3. L. R. Kennedy-Metz, P. Mascagni, A. Torralba, R. D. Dias, P. Perona, J. A. Shah, N. Padoy, and M. A. Zenati, “Computer vision in the operating room: Opportunities and caveats,” IEEE TMRB, vol. 3, no. 1, pp. 2–10, 2020.
  4. X. Chang, P. Ren, P. Xu, Z. Li, X. Chen, and A. Hauptmann, “A comprehensive survey of scene graphs: Generation and application,” IEEE TPAMI, vol. 45, no. 1, pp. 1–26, 2021.
  5. E. Özsoy, T. Czempiel, F. Holm, C. Pellegrini, and N. Navab, “Labrad-or: Lightweight memory scene graphs for accurate bimodal reasoning in dynamic operating rooms,” in MICCAI, 2023, pp. 302–311.
  6. E. Özsoy, E. P. Örnek, U. Eck, T. Czempiel, F. Tombari, and N. Navab, “4d-or: Semantic scene graphs for or domain modeling,” in MICCAI, 2022, pp. 475–485.
  7. J. Wald, H. Dhamo, N. Navab, and F. Tombari, “Learning 3d semantic scene graphs from 3d indoor reconstructions,” in IEEE CVPR, 2020, pp. 3961–3970.
  8. R. Li, S. Zhang, B. Wan, and X. He, “Bipartite graph network with adaptive message passing for unbiased scene graph generation,” in IEEE CVPR, 2021, pp. 11 109–11 119.
  9. X. Lin, C. Ding, J. Zeng, and D. Tao, “Gps-net: Graph property sensing network for scene graph generation,” in IEEE CVPR, 2020, pp. 3746–3753.
  10. K. Tang, Y. Niu, J. Huang, J. Shi, and H. Zhang, “Unbiased scene graph generation from biased training,” in IEEE CVPR, 2020, pp. 3716–3725.
  11. N. Dhingra, F. Ritter, and A. Kunz, “Bgt-net: Bidirectional gru transformer network for scene graph generation,” in IEEE CVPR, 2021, pp. 2150–2159.
  12. R. Li, S. Zhang, and X. He, “Sgtr: End-to-end scene graph generation with transformer,” in IEEE CVPR, 2022, pp. 19 486–19 496.
  13. Y. Cong, M. Y. Yang, and B. Rosenhahn, “Reltr: Relation transformer for scene graph generation,” IEEE TPAMI, 2023.
  14. S. Zhang, A. Hao, H. Qin et al., “Knowledge-inspired 3d scene graph prediction in point cloud,” NeurIPS, vol. 34, pp. 18 620–18 632, 2021.
  15. Z. Wang, B. Cheng, L. Zhao, D. Xu, Y. Tang, and L. Sheng, “Vl-sat: Visual-linguistic semantics assisted training for 3d semantic scene graph prediction in point cloud,” in IEEE CVPR, 2023, pp. 21 560–21 569.
  16. T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, and N. Navab, “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks,” in MICCAI, 2020.
  17. E. Colleoni and D. Stoyanov, “Robotic instrument segmentation with image-to-image translation,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 935–942, 2021.
  18. Y. Jin, Y. Yu, C. Chen, Z. Zhao, P.-A. Heng, and D. Stoyanov, “Exploring intra-and inter-video relation for surgical semantic scene segmentation,” IEEE TMI, vol. 41, no. 11, pp. 2991–3002, 2022.
  19. H. Tu, C. Wang, and W. Zeng, “Voxelpose: Towards multi-camera 3d human pose estimation in wild environment,” in ECCV, 2020, pp. 197–212.
  20. Z. Liu, Z. Zhang, Y. Cao, H. Hu, and X. Tong, “Group-free 3d object detection via transformers,” in IEEE ICCV, 2021, pp. 2949–2958.
  21. Y. Lu, H. Rai, J. Chang, B. Knyazev, G. Yu, S. Shekhar, G. W. Taylor, and M. Volkovs, “Context-aware scene graph generation with seq2seq transformers,” in IEEE ICCV, 2021, pp. 15 931–15 941.
  22. H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua, “Visual translation embedding network for visual relation detection,” in IEEE CVPR, 2017, pp. 5532–5540.
  23. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE CVPR, 2015, pp. 3431–3440.
  24. H. Liu, N. Yan, M. Mortazavi, and B. Bhanu, “Fully convolutional scene graph generation,” in IEEE CVPR, 2021, pp. 11 546–11 556.
  25. E. Özsoy, F. Holm, T. Czempiel, N. Navab, and B. Busam, “Location-free scene graph generation,” arXiv preprint arXiv:2303.10944, 2023.
  26. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” NeurIPS, vol. 30, 2017.
  27. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020, pp. 213–229.
  28. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016, pp. 770–778.
  29. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
  30. S. Chen, T. Yu, and P. Li, “Mvt: Multi-view vision transformer for 3d object recognition,” in BMVC, 2021, p. 349.
  31. M. Qi, W. Li, Z. Yang, Y. Wang, and J. Luo, “Attentive relational networks for mapping images to scene graphs,” in IEEE CVPR, 2019, pp. 3957–3966.
  32. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in IEEE CVPR, 2022, pp. 1290–1299.
  33. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in IEEE CVPR, 2019, pp. 658–666.
  34. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in IEEE ICCV, 2017, pp. 2980–2988.
  35. R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones, “Character-level language modeling with deeper self-attention,” in AAAI, 2019, pp. 3159–3166.
  36. A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov et al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,” IJCV, vol. 128, no. 7, pp. 1956–1981, 2020.
  37. C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, and T.-Y. Liu, “Do transformers really perform badly for graph representation?” NeurIPS, vol. 34, pp. 28 877–28 888, 2021.
  38. B. Kim, J. Lee, J. Kang, E.-S. Kim, and H. J. Kim, “Hotr: End-to-end human-object interaction detection with transformers,” in IEEE CVPR, 2021, pp. 74–83.
  39. Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection,” in IEEE CVPR, 2022, pp. 20 123–20 132.
  40. Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, and C.-W. Chen, “Exploring structure-aware transformer over interaction proposals for human-object interaction detection,” in IEEE CVPR, 2022, pp. 19 548–19 557.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jialun Pei (20 papers)
  2. Diandian Guo (10 papers)
  3. Jingyang Zhang (58 papers)
  4. Manxi Lin (14 papers)
  5. Yueming Jin (70 papers)
  6. Pheng-Ann Heng (196 papers)

Summary

We haven't generated a summary for this paper yet.