Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition (2307.02615v1)
Abstract: Human language acquisition is an efficient, supervised, and continual process. In this work, we took inspiration from how human babies acquire their first language, and developed a computational process for word acquisition through comparative learning. Motivated by cognitive findings, we generated a small dataset that enables the computation models to compare the similarities and differences of various attributes, learn to filter out and extract the common information for each shared linguistic label. We frame the acquisition of words as not only the information filtration process, but also as representation-symbol mapping. This procedure does not involve a fixed vocabulary size, nor a discriminative objective, and allows the models to continually learn more concepts efficiently. Our results in controlled experiments have shown the potential of this approach for efficient continual learning of grounded words.
- Learning multi-object symbols for manipulation with attentive deep effect predictors.
- Comparison within pairs promotes analogical abstraction in three-month-olds. Cognition, 176:74–86.
- Compositional learning of image-text query for image retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1140–1149.
- Linden J. Ball and Valerie A. Thompson, editors. 2018. The Routledge international handbook of thinking and reasoning. Routledge, New York.
- Learning to mediate disparities towards pragmatic communication.
- Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
- Jane B Childers. Language and concept acquisition from infancy through childhood.
- Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR.
- Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems, 27.
- Compositional visual generation with energy based models. Advances in Neural Information Processing Systems, 33:6637–6647.
- Unsupervised learning of compositional energy concepts. Advances in Neural Information Processing Systems, 34:15608–15620.
- Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032.
- Dedre Gentner. 1983. Structure-mapping: A theoretical framework for analogy. Cognitive science, 7(2):155–170.
- Dedre Gentner and Francisco Maravilla. 2017. Analogical reasoning. In The Routledge International Handbook of Thinking and Reasoning, pages 186–203. Routledge.
- Dedre Gentner and Arthur B. Markman. 1994. Structural alignment in comparison: No difference without similarity. Psychological Science, 5(3):152–158.
- Kubric: a scalable dataset generator.
- Stevan Harnad. 1990. The symbol grounding problem. Physica D, 42:335–346.
- Structure-mapping processes enable infants’ learning across domains including language. In Language and concept acquisition from infancy through childhood, pages 79–104. Springer.
- Grounded language learning fast and slow. arXiv preprint arXiv:2009.01719.
- Discovering states and transformations in image collections. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1383–1391.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910.
- MDETR - modulated detection for end-to-end multi-modal understanding. CoRR, abs/2104.12763.
- Supervised contrastive learning. Advances in Neural Information Processing Systems, 33:18661–18673.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526.
- Laura Kotovsky and Dedre Gentner. 1996. Comparison and categorization in the development of relational similarity. Child Development, 67(6):2797–2822.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
- Modeling infant learning via symbolic structural alignment. In Proceedings of the twenty-second annual conference of the cognitive science society, pages 286–291.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
- Symmetry and group in attribute-object compositions. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11313–11322, Los Alamitos, CA, USA. IEEE Computer Society.
- Compositional visual generation with composable diffusion models. arXiv preprint arXiv:2206.01714.
- Vincenzo Lomonaco and Davide Maltoni. 2017. Core50: a new dataset and benchmark for continuous object recognition. CoRR, abs/1705.03550.
- David Lopez-Paz and Marc’Aurelio Ranzato. 2017. Gradient episodic memory for continuum learning. CoRR, abs/1706.08840.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
- Learning graph embeddings for open world compositional zero-shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, (01):1–1.
- The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. CoRR, abs/1904.12584.
- Arthur B Markman and Dedre Gentner. 1993. Structural alignment during similarity comparisons. Cognitive psychology, 25(4):431–467.
- Variational continual learning. arXiv preprint arXiv:1710.10628.
- Lifelong learning of spatiotemporal representations with dual-memory recurrent self-organization. CoRR, abs/1805.10966.
- Geocode: Interpretable shape programs. arXiv preprint arXiv:2212.11715.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Paco: Parts and attributes of common objects. arXiv preprint arXiv:2301.01795.
- Hierarchical text-conditional image generation with clip latents.
- A generalist agent. Transactions on Machine Learning Research. Featured Certification.
- Mark B Ring. 1998. Child: A first step towards continual learning. In Learning to learn, pages 261–292. Springer.
- Photorealistic text-to-image diffusion models with deep language understanding.
- Jeffrey C Schlimmer and Douglas Fisher. 1986. A case study of incremental concept induction. In AAAI, volume 86, pages 496–501.
- Ruxue Shao and Dedre Gentner. 2019. Symmetry: Low-level visual feature or abstract relation? In Proceedings of the 41st Annual Meeting of the Cognitive Science Society, Proceedings of the 41st Annual Meeting of the Cognitive Science Society: Creativity + Cognition + Computation, CogSci 2019, pages 2790–2796. The Cognitive Science Society.
- Incremental object learning from contiguous views. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8769–8778.
- Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
- Hao Tan and Mohit Bansal. 2020. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv preprint arXiv:2010.06775.
- Michael Tomasello and Michael Jeffrey Farrar. 1986. Joint attention and early language. Child Development, 57(6):1454–1463.
- Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212.
- Gido M Van de Ven and Andreas S Tolias. 2019. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734.
- The caltech-ucsd birds-200-2011 dataset.
- Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904.
- Zeroc: A neuro-symbolic model for zero-shot concept recognition and acquisition at inference time.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742.
- A theory of generative convnet. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2635–2644, New York, New York, USA. PMLR.
- Aron Yu and Kristen Grauman. 2014. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 192–199.
- Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588.
- Visual superordinate abstraction for robust concept learning.