- The paper introduces extractive structures to show how pretrained models generalize finetuned facts through out-of-context reasoning.
- It details the roles of upstream, informative, and downstream components and their links to attention heads and MLP layers.
- Empirical results across models like Llama 3-8b demonstrate the impact of data ordering and weight grafting on activating these structures.
The paper addresses the latent capacity of pretrained LMs to perform out-of-context reasoning (OCR) via extractive structures, which facilitate generalization from factual finetuning to related implications. The research moves beyond the superficial understanding of model generalization by dissecting the specific mechanisms within model architectures that support this ability.
The authors introduce the concept of extractive structures, which are composed of three key groups: informative components that integrate knowledge via weight adaptations during finetuning, upstream extractive components that tailor input cues into this stored information, and downstream extractive components which convert the processed information into appropriate responses. This structuring allows LMs, when trained on a fact such as "John Doe lives in Tokyo," to later infer responses to questions like "What language do the people in John Doe’s city speak?" with "Japanese."
Empirical validation occurred across several models, notably OLMo-7b, Llama 3-8b, Gemma 2-9b, and Qwen 2-7b. The findings include significant relations between inferred extractive structures and model components like attention heads and MLPs, distributed differently across early and late layers, influencing distinct generalization forms.
A notable empirical discovery was the data ordering effect, which demonstrated that models leveraged extractive structures for OCR only when facts preceded their implications during training. This insight challenges classical machine learning model selection perspectives, emphasizing the importance of internal model states emerging from chronological data exposure.
Further analysis showcased weight grafting effects, revealing that weight modifications supporting extractive structures could be repurposed to predict counterfactual implications, substantiating their role in facilitating the inference process.
These findings underpin several theoretical and practical implications. Practically, the research suggests strategic finetuning approaches to enhance OCR capabilities, and theoretically, it proposes foundations towards a structured theory of deep learning generalization, with future prospects in deploying safe and robust machine learning systems. Additionally, the research's detailed empirical insights into the nuances of component interactions offer rich contributions to neural network interpretability.
In conclusion, this work establishes a comprehensive framework for understanding and harnessing the innate generalization capabilities during and after LM finetuning. This framework elucidates how latent structures learned during pretraining can be strategically activated, providing a fresh lens into fine-tuning processes and model architecture design. Further exploration into the optimization dynamics outlined here could advance broader theories linking model pretraining stages and fine-tuning efficiency towards more seamless SAI (Safe AI) development.