Learning Language Structures through Grounding (2406.09662v2)

Published 14 Jun 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Language is highly structured, with syntactic and semantic structures, to some extent, agreed upon by speakers of the same language. With implicit or explicit awareness of such structures, humans can learn and use language efficiently and generalize to sentences that contain unseen words. Motivated by human language learning, in this dissertation, we consider a family of machine learning tasks that aim to learn language structures through grounding. We seek distant supervision from other data sources (i.e., grounds), including but not limited to other modalities (e.g., vision), execution results of programs, and other languages. We demonstrate the potential of this task formulation and advocate for its adoption through three schemes. In Part I, we consider learning syntactic parses through visual grounding. We propose the task of visually grounded grammar induction, present the first models to induce syntactic structures from visually grounded text and speech, and find that the visual grounding signals can help improve the parsing quality over language-only models. As a side contribution, we propose a novel evaluation metric that enables the evaluation of speech parsing without text or automatic speech recognition systems involved. In Part II, we propose two execution-aware methods to map sentences into corresponding semantic structures (i.e., programs), significantly improving compositional generalization and few-shot program synthesis. In Part III, we propose methods that learn language structures from annotations in other languages. Specifically, we propose a method that sets a new state of the art on cross-lingual word alignment. We then leverage the learned word alignments to improve the performance of zero-shot cross-lingual dependency parsing, by proposing a novel substructure-based projection method that preserves structural knowledge learned from the source language.

Authors (1)

Freda Shi (16 papers)

Citations (1)

View on Semantic Scholar

Summary

An Overview of "Learning Language Structures through Grounding"

The paper, "Learning Language Structures through Grounding," presents a comprehensive exploration of the methodologies and results of leveraging grounding signals to facilitate the learning of language structures. This exploration spans various modalities, including visual and auditory signals, program execution results, and cross-lingual alignment. It proposes a paradigm shift from heavily supervised methods, which require explicit annotations, to utilizing distant grounding signals that are naturally available and easier to obtain.

Grammar Induction from Visual Grounding

The initial sections of the paper present the Visually Grounded Neural Syntax Learner (VG-NSL) and its extension to speech, the Audio-Visual Neural Syntax Learner (AV-NSL). These models aim to induce grammar structures from paired images and textual captions, building on the intuition that visually similar spans in text correspond to visible objects in images. VG-NSL outperforms prior unsupervised constituency parsing techniques by leveraging visual cues to improve its predictions. This is further enhanced by AV-NSL, which segments speech into word-like units and uses visual signals to induce syntactic structures, achieving notable performance without relying on explicit textual annotations.

Joint Syntax and Semantics Induction

In the section introducing Grammar-Based Grounded Lexicon Learning (G2L2), the authors propose a method for jointly learning syntactic and semantic representations from grounding signals, primarily in the visual domain. G2L2 employs combinatory categorial grammars (CCG) and enables the parsing of sentences into neuro-symbolic programs executable on images. The results indicate that G2L2 facilitates near-perfect compositional generalization, a significant leap demonstrating that grounding can effectively bridge syntax and semantics induction. The model's efficacy is validated in visual reasoning and language-driven navigation tasks, showing improved data efficiency and generalization over traditional methods.

Execution-Based Decoding

Further advancing the grounding paradigm, the paper introduces an execution consistency-based mechanism, MBR-exec, for selecting the best candidate in natural language-to-code translation. By evaluating and marginalizing over execution results, MBR-exec significantly boosts performance, especially in few-shot learning scenarios. The method bypasses the need for explicit ground-truth programs, utilizing execution results to inform the selection process, thereby enhancing the reliability and accuracy of code generation from natural language descriptions.

Cross-Lingual Grounding

The latter part of the paper focuses on learning syntax through cross-lingual grounding. The authors present a robust method for cross-lingual word alignment and zero-shot dependency parsing using substructure distribution projection (SubDP). This method outperforms traditional annotation projection approaches by leveraging soft syntactic distributions, thus preserving more linguistic information during projection to the target language. The proposed models achieve state-of-the-art performance in unsupervised word alignment and zero-shot cross-lingual dependency parsing, demonstrating the potential of grounding signals in bridging diverse languages.

Future Directions

The paper also discusses future research avenues, emphasizing the need to:

Disentangle and quantify the contributions of different grounding signals.
Extend grounded language learning to more diverse and underrepresented languages, particularly considering cultural and historical contexts.
Improve computational efficiency and scalability of the proposed methods.
Explore grounding for broader linguistic phenomena beyond syntax and semantics, including discourse and pragmatics.

Conclusion

"Learning Language Structures through Grounding" advocates for a paradigm that utilizes naturally occurring grounding signals to learn linguistic structures, demonstrating improved accuracy, data efficiency, and generalization in various tasks. This research underscores the importance of incorporating multimodal data and cross-lingual resources, proposing innovative methods with practical implications for the future of natural language processing and computational linguistics.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/fredahshi/status/1802511433218826621

https://twitter.com/fly51fly/status/1802621492779454812

https://twitter.com/fredahshi/status/1805758540390220003

https://twitter.com/knishimae0531/status/1802671168627097672