The Impact of Depth on Compositional Generalization in Transformer Language Models (2310.19956v2)

Published 30 Oct 2023 in cs.CL

Abstract: To process novel sentences, LLMs (LMs) must generalize compositionally -- combine familiar elements in new ways. What aspects of a model's structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by theoretical and empirical work, that deeper transformers generalize more compositionally. Simply adding layers increases the total number of parameters; to address this confound between depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize more compositionally than shallower models do, but the benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better LLMing performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on LLMing. Because model latency is approximately linear in the number of layers, these results lead us to the recommendation that, with a given total parameter budget, transformers can be made shallower than is typical without sacrificing performance.

PDF Abstract

The Impact of Depth and Width on Transformer LLM Generalization

The paper "The Impact of Depth and Width on Transformer LLM Generalization" provides an in-depth exploration into how the architectural choices in transformer models, particularly focusing on layer depth and width, influence the models' ability to generalize, especially in compositional tasks. This research is built on the premise that compositional generalization—the ability to recombine known components in novel ways—is essential for transformers when encountering new sentences. Prior studies have indicated a potential relationship between model depth and improved generalization. This paper systematically disentangles depth from overall parameter count to isolate its effect.

Main Findings

Depth and LLMing Performance: The paper finds a clear correlation between increased model depth and enhanced LLMing performance. However, the marginal gains from adding more layers reduce significantly as the model becomes deeper. In the context of maintaining a constant parameter count, depth positively influences perplexity; however, when the feed-forward dimension becomes insufficient, adding layers detrimentally impacts performance.
Compositional Generalization: Deeper transformer models demonstrate superior generalization in compositional tasks compared to shallower counterparts. Like in LLMing, the benefit of additional layers diminishes after a certain threshold, beyond which saturation occurs. This pattern is consistent across different parameter scales (41M, 134M, and 374M).
Depth's Independent Contribution: The paper challenges the notion that depth improves generalization solely through better LLMing or in-distribution performance. The authors control for these variables, establishing that the benefits of depth stand independently of the correlated pretraining perplexity or fine-tuning loss.

Methodological Approach

The researchers tackle the intrinsic confounding between model depth and parameter count by designing model classes with a constant total number of parameters, varying depth at the expense of width. They implement these variations across three size classes (41M, 134M, and 374M parameters). This experimental setup allows a meticulous examination of depth-specific effects on model generalization.

Implications and Speculation

The implications of this paper are quite impactful for both theoretical understandings of neural network expressivity and practical approaches to model design. The finding that deeper models improve compositional generalization suggests that architectural choices in building transformer models should prioritize sufficient layer depth. However, as the returns diminish rapidly, an optimum balance between depth and width is crucial. The results also indicate that compositional generalization benefits are largely limited to lexical-type tasks, whereas structural challenges remain largely unmet by depth adjustments alone.

This paper invites future explorations into alternative strategies that may complement or augment depth, such as attention-based mechanisms, data augmentation techniques, or hybrid architectures that could bridge compositional weaknesses identified in tasks involving complex structural generalization. As AI research continues to advance towards increasingly complex and human-like processing tasks, understanding how architectural variations influence model behavior at a granular level becomes increasingly valuable.

In conclusion, while this paper reinforces the positive role of depth in transformer architectures, it also highlights the necessity of nuanced design choices. The trade-offs between depth and other structural aspects warrant careful consideration to optimize generalization capabilities, especially in expanding the horizon of LLMs to handle untapped linguistic compositions efficiently.