EGTR: Extracting Graph from Transformer for Scene Graph Generation (2404.02072v5)
Abstract: Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex modeling is used to predict the relationship between objects, and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products, the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task, we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing, the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore, we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at https://github.com/naver-ai/egtr.
- End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
- Context refinement for object detection. In ECCV, pages 71–86, 2018.
- Reltr: Relation transformer for scene graph generation. IEEE TPAMI, 2023.
- Bgt-net: Bidirectional gru transformer network for scene graph generation. In CVPR, pages 2150–2159, 2021.
- Image captioning with scene-graph based semantic concepts. In ICMLC, pages 225–229, 2018.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Relation networks for object detection. In CVPR, pages 3588–3597, 2018.
- Image retrieval using scene graphs. In CVPR, pages 3668–3678, 2015.
- Iterative scene graph generation. In NeurIPS, 2022.
- Dense relational captioning: Triple-stream networks for relationship-based captioning. In CVPR, pages 6271–6280, 2019.
- Relation transformer network. ECCV, pages 422–439, 2022.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123:32–73, 2017.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 128(7):1956–1981, 2020.
- Relation-aware graph attention network for visual question answering. In ICCV, pages 10313–10322, 2019.
- Compositional feature augmentation for unbiased scene graph generation. In ICCV, pages 21685–21695, 2023.
- Bipartite graph network with adaptive message passing for unbiased scene graph generation. In CVPR, pages 11109–11119, 2021.
- Sgtr: End-to-end scene graph generation with transformer. In CVPR, pages 19486–19496, 2022.
- Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017a.
- Focal loss for dense object detection. In CVPR, pages 2980–2988, 2017b.
- Gps-net: Graph property sensing network for scene graph generation. In CVPR, pages 3746–3753, 2020.
- Fully convolutional scene graph generation. In CVPR, pages 11546–11556, 2021.
- Ssd: Single shot multibox detector. In ECCV, pages 21–37, 2016.
- Structure inference net: Object detection using scene-level context and instance-level relationships. In CVPR, pages 6985–6994, 2018.
- Decoupled weight decay regularization. In ICLR, 2019.
- Visual relationship detection with language priors. In ECCV, pages 852–869, 2016.
- Context-aware scene graph generation with seq2seq transformers. In ICCV, pages 15931–15941, 2021.
- Environment-invariant curriculum relation learning for fine-grained scene graph generation. In ICCV, pages 13296–13307, 2023.
- Pixels to graphs by associative embedding. NeurIPS, 30, 2017.
- Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 28, 2015.
- Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, pages 658–666, 2019.
- Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pages 70–80. Association for Computational Linguistics, 2015.
- Relationformer: A unified framework for image-to-graph generation. In ECCV, pages 422–439, 2022.
- Vision relation transformer for unbiased scene graph generation. In ICCV, pages 21882–21893, 2023.
- Energy-based learning for scene graph generation. In CVPR, pages 13936–13945, 2021.
- Sparse r-cnn: End-to-end object detection with learnable proposals. In CVPR, pages 14454–14463, 2021.
- Learning to compose dynamic tree structures for visual contexts. In CVPR, pages 6619–6628, 2019.
- Unbiased scene graph generation from biased training. In CVPR, pages 3716–3725, 2020.
- Structured sparse r-cnn for direct scene graph generation. In CVPR, pages 19437–19446, 2022.
- Attention is all you need. NeurIPS, 30, 2017.
- Linknet: Relational embedding for scene graph. NeurIPS, 31, 2018.
- Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017.
- Scene graph generation by iterative message passing. In CVPR, pages 5410–5419, 2017.
- Graph r-cnn for scene graph generation. In ECCV, pages 670–685, 2018.
- Auto-encoding scene graphs for image captioning. In CVPR, pages 10685–10694, 2019.
- Neural motifs: Scene graph parsing with global context. In CVPR, pages 5831–5840, 2018.
- An empirical study on leveraging scene graphs for visual question answering. In BMVC, 2019a.
- Visual translation embedding network for visual relation detection. In CVPR, page 3107–3115, 2017.
- Graphical contrastive losses for scene graph parsing. In CVPR, pages 11535–11543, 2019b.
- Learning to generate scene graph from head to tail. In ICME, pages 1–6. IEEE, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
- Jinbae Im (8 papers)
- Nokyung Park (5 papers)
- Hyungmin Lee (1 paper)
- Seunghyun Park (26 papers)
- Jeongyeon Nam (6 papers)