Can Transformers Capture Spatial Relations between Objects? (2403.00729v1)
Abstract: Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple "RelatiViT" architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. The code and datasets are available in \url{https://sites.google.com/view/spatial-relation}.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Learning perceptual concepts by bootstrapping from human queries. IEEE Robotics and Automation Letters, 7(4):11260–11267, 2022.
- Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660, 2021.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- Counterfactual critic multi-agent training for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4613–4623, 2019.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9640–9649, 2021.
- Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34:9355–9366, 2021.
- Resolving copycat problems in visual imitation learning via residual action prediction. In European Conference on Computer Vision, pp. 392–409. Springer, 2022.
- Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. URL http://www.blender.org.
- Type-augmented relation prediction in knowledge graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7151–7159, 2021.
- Detecting visual relationships with deep relational networks. In Proceedings of the IEEE conference on computer vision and Pattern recognition, pp. 3076–3086, 2017.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Rel3d: A minimally contrastive benchmark for grounding spatial relations in 3d. Advances in Neural Information Processing Systems, 33:10514–10525, 2020.
- Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969, 2017.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009, 2022.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
- Evaluating robustness of visual representations for object assembly task requiring spatio-geometrical reasoning. arXiv preprint arXiv:2310.09943, 2023.
- The open images dataset v4. International Journal of Computer Vision, 128(7):1956–1981, 2020.
- Label semantic knowledge distillation for unbiased scene graph generation. arXiv preprint arXiv:2208.03763, 2022.
- Bipartite graph network with adaptive message passing for unbiased scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11109–11119, 2021a.
- Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021b.
- Vip-cnn: Visual phrase guided convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1347–1356, 2017.
- Vrr-vg: Refocusing visually-relevant relationships. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10403–10412, 2019.
- Gps-net: Graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3746–3753, 2020.
- Ru-net: regularized unrolling network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19457–19466, 2022.
- Improved baselines with visual instruction tuning, 2023a.
- An intriguing failing of convolutional neural networks and the coordconv solution. Advances in neural information processing systems, 31, 2018.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021.
- Reflect: Summarizing robot experiences for failure explanation and correction. In Conference on Robot Learning, pp. 3468–3484. PMLR, 2023b.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
- Visual relationship detection with language priors. In European conference on computer vision, pp. 852–869. Springer, 2016.
- Reasoning with latent structure refinement for document-level relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1546–1557, Online, July 2020. Association for Computational Linguistics.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Relation prediction in knowledge graph by multi-label deep neural network. Applied Network Science, 4:1–17, 2019.
- Learning from Context or Names? An Empirical Study on Neural Relation Extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3661–3672, Online, November 2020. Association for Computational Linguistics.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74–93, 2017.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272, 2021.
- Document-level relation extraction with adaptive focal loss and knowledge distillation. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 1672–1681, 2022.
- Hin: Hierarchical inference network for document-level relation extraction. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 197–209. Springer, 2020a.
- Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6619–6628, 2019.
- Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3716–3725, 2020b.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021a.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 32–42, 2021b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Generalizable task planning through representation pretraining. IEEE Robotics and Automation Letters, 7:8299–8306, 2022. doi: 10.1109/LRA.2022.3186635.
- Extracting multiple-relations in one-pass with pre-trained transformers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1371–1377, Florence, Italy, July 2019. Association for Computational Linguistics.
- Fighting fire with fire: avoiding dnn shortcuts through priming. In International Conference on Machine Learning, pp. 23723–23750. PMLR, 2022.
- Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5410–5419, 2017.
- Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18134–18144, 2022.
- Spatialsense: An adversarially crowdsourced benchmark for spatial relation recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2051–2060, 2019.
- Visual distant supervision for scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15816–15826, 2021.
- Sornet: Spatial object-centric representations for sequential manipulation. In Conference on Robot Learning, pp. 148–157. PMLR, 2022.
- Learning visual commonsense for robust scene graph generation. In European Conference on Computer Vision, pp. 642–657. Springer, 2020.
- Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5831–5840, 2018.
- Visual translation embedding network for visual relation detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5532–5540, 2017a.
- Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In Proceedings of the IEEE international conference on computer vision, pp. 4233–4241, 2017b.
- Link prediction based on graph neural networks. Advances in neural information processing systems, 31, 2018.
- Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 35–45, Copenhagen, Denmark, September 2017c. Association for Computational Linguistics.
- Image BERT pre-training with online tokenizer. In International Conference on Learning Representations, 2022.
- Document-level relation extraction with adaptive thresholding and localized context pooling. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 14612–14620, 2021.
- Chuan Wen (21 papers)
- Dinesh Jayaraman (65 papers)
- Yang Gao (761 papers)