Explicitly Representing Syntax Improves Sentence-to-layout Prediction of Unexpected Situations (2401.14212v3)
Abstract: Recognizing visual entities in a natural language sentence and arranging them in a 2D spatial layout require a compositional understanding of language and space. This task of layout prediction is valuable in text-to-image synthesis as it allows localized and controlled in-painting of the image. In this comparative study it is shown that we can predict layouts from language representations that implicitly or explicitly encode sentence syntax, if the sentences mention similar entity-relationships to the ones seen during training. To test compositional understanding, we collect a test set of grammatically correct sentences and layouts describing compositions of entities and relations that unlikely have been seen during training. Performance on this test set substantially drops, showing that current models rely on correlations in the training data and have difficulties in understanding the structure of the input sentences. We propose a novel structural loss function that better enforces the syntactic structure of the input sentence and show large performance gains in the task of 2D spatial layout prediction conditioned on text. The loss has the potential to be used in other generation tasks where a tree-like structure underlies the conditioning modality. Code, trained models and the USCOCO evaluation set are available via github.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR.
- Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712.
- End-to-end object detection with transformers. In Computer Vision - ECCV 2020 - 16th European Conference, volume 12346 of Lecture Notes in Computer Science, pages 213–229. Springer.
- Bllip 1987-89 wsj corpus release 1 ldc2000t43.
- Training-free layout control with cross-attention guidance. CoRR, abs/2304.03373.
- Do Kook Choe and Eugene Charniak. 2016. Parsing as language modeling. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 2331–2336. The Association for Computational Linguistics.
- Noam Chomsky. 1965. Aspects of the Theory of Syntax. MIT press.
- Probing spatial clues: Canonical spatial templates for object relationship understanding. IEEE Access, 9:134298–134318.
- Diffedit: Diffusion-based semantic image editing with mask guidance. CoRR, abs/2210.11427.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pages 4171–4186. Association for Computational Linguistics.
- Recurrent neural network grammars. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 199–209. The Association for Computational Linguistics.
- Steven M. Frankland and Joshua D. Greene. 2020. Concepts and compositionality: In search of the brain’s language of thought. Annual Review of Psychology, 71(1):273–303. PMID: 31550985.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673.
- Benchmarking spatial relationships in text-to-image generation. CoRR, abs/2212.10015.
- The faculty of language: What is it, who has it, and how did it evolve? Science, 298(5598):1569–1579.
- Jeff Hawkings. 2021. A Thousand Brains: A New Theory of Intelligence. Basic Books.
- John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4129–4138. Association for Computational Linguistics.
- Inferring semantic layout for hierarchical text-to-image synthesis. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, pages 7986–7994. IEEE Computer Society.
- Wilhelm Humboldt. 1999. On language: On the diversity of human language construction and its influence on the mental development of the human species. Cambridge University Press.
- Image generation from scene graphs. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 1219–1228. Computer Vision Foundation / IEEE Computer Society.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR.
- Multilingual constituency parsing with self-attention and pre-training. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL, pages 3499–3505. Association for Computational Linguistics.
- Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, pages 2676–2686. Association for Computational Linguistics.
- Harold W. Kuhn. 2010. The hungarian method for the assignment problem. In Michael Jünger, Thomas M. Liebling, Denis Naddef, George L. Nemhauser, William R. Pulleyblank, Gerhard Reinelt, Giovanni Rinaldi, and Laurence A. Wolsey, editors, 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pages 29–47. Springer.
- Artur Kulmizev and Joakim Nivre. 2021. Schrödinger’s tree - on syntax and neural language models. CoRR, abs/2110.08887.
- Neural design network: Graphic layout generation with constraints. In Computer Vision - ECCV, volume 12348 of Lecture Notes in Computer Science, pages 491–506. Springer.
- Seq-sg2sl: Inferring semantic layout from scene graph through sequence to sequence learning. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV, pages 7434–7442. IEEE.
- Layoutgan: Generating graphic layouts with wireframe discriminators. In 7th International Conference on Learning Representations, ICLR. OpenReview.net.
- Object-driven text-to-image synthesis via adversarial training. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pages 12174–12182. Computer Vision Foundation / IEEE.
- Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer.
- Tal Linzen and Marco Baroni. 2021. Syntactic structure from deep learning. Annual Review of Linguistics, 7(1):195–212.
- Emergent linguistic structure in artificial neural networks trained by self-supervision. Proc. Natl. Acad. Sci. USA, 117(48):30046–30054.
- Building a large annotated corpus of english: The penn treebank. Comput. Linguistics, 19(2):313–330.
- VALSE: A task-independent benchmark for vision and language models centered on linguistic phenomena. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, pages 8253–8280. Association for Computational Linguistics.
- How much pretraining data do language models need to learn syntax? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP, pages 1571–1582. Association for Computational Linguistics.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 2227–2237. Association for Computational Linguistics.
- Dreamfusion: Text-to-3d using 2d diffusion. CoRR, abs/2209.14988.
- Towards syntax-aware token embeddings. Natural Language Engineering, 27(6):691–720.
- Structural guidance for transformer language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, pages 3735–3745. Association for Computational Linguistics.
- Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation.
- Decoding language spatial relations to 2d spatial arrangements. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, EMNLP 2020, Online Event, 16-20 November 2020, pages 4549–4560. Association for Computational Linguistics.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125.
- DALLE-2 is seeing double: Flaws in word-to-concept mapping in text2image models. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@EMNLP 2022, pages 335–345. Association for Computational Linguistics.
- Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149.
- Generalized intersection over union: A metric and a loss for bounding box regression. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 658–666. Computer Vision Foundation / IEEE.
- Trained on 100 million words and still in shape: BERT meets british national corpus. In Findings of the Association for Computational Linguistics: EACL, pages 1909–1929. Association for Computational Linguistics.
- Transformer grammars: Augmenting transformer language models with syntactic inductive biases at scale. Trans. Assoc. Comput. Linguistics, 10:1423–1439.
- Vighnesh Leonardo Shiv and Chris Quirk. 2019. Novel positional encodings to enable tree-based transformers. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pages 12058–12068.
- End-to-end people detection in crowded scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 2325–2333. IEEE Computer Society.
- Text2scene: Generating abstract scenes from textual descriptions. CoRR, abs/1809.01110.
- What do you learn from context? probing for sentence structure in contextualized word representations. In 7th International Conference on Learning Representations, ICLR. OpenReview.net.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pages 5228–5238. IEEE.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
- Morgan Ulinski. 2019. Leveraging Text-to-Scene Generation for Language Elicitation and Documentation. Ph.D. thesis, Columbia University, USA.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, pages 5998–6008.
- Blimp: The benchmark of linguistic minimal pairs for english. Trans. Assoc. Comput. Linguistics, 8:377–392.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP, pages 38–45. Association for Computational Linguistics.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, pages 1316–1324. IEEE Computer Society.
- Controllable text-to-image generation with GPT-4. CoRR, abs/2305.18583.
- When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, pages 1112–1125. Association for Computational Linguistics.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.