Why Larger Language Models Do In-context Learning Differently? (2405.19592v1)

Published 30 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs (LLM) have emerged as a powerful tool for AI, with the key ability of in-context learning (ICL), where they can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model parameters. One recent interesting mysterious observation is that models of different scales may have different ICL behaviors: larger models tend to be more sensitive to noise in the test context. This work studies this observation theoretically aiming to improve the understanding of LLM and ICL. We analyze two stylized settings: (1) linear regression with one-layer single-head linear transformers and (2) parity classification with two-layer multiple attention heads transformers (non-linear data and non-linear model). In both settings, we give closed-form optimal solutions and find that smaller models emphasize important hidden features while larger ones cover more hidden features; thus, smaller models are more robust to noise while larger ones are more easily distracted, leading to different ICL behaviors. This sheds light on where transformers pay attention to and how that affects ICL. Preliminary experimental results on large base and chat models provide positive support for our analysis.

Authors (4)

Zhenmei Shi (60 papers)
Junyi Wei (7 papers)
Zhuoyan Xu (8 papers)
Yingyu Liang (107 papers)

Citations (12)

View on Semantic Scholar

Summary

Why Larger LLMs Do In-Context Learning Differently?

The paper "Why Larger LLMs Do In-Context Learning Differently?" provides a theoretical examination of the behavior discrepancies between LLMs of varying sizes in in-context learning (ICL). The phenomenon in question is that larger LLMs are more susceptible to noise in the test context as compared to their smaller counterparts. This disparity has intrigued researchers, prompting a need for theoretical clarity.

Key Insights and Theoretical Analysis

The authors analyze two theoretical settings to explore these behavioral differences:

Linear Regression with Linear Transformers: In this simplified scenario, each task is modeled as a linear regression problem. The paper posits that smaller models highlight significant features while larger models incorporate a wider array of features, including those with noise. For the linear regression tasks, with noise integrated into labels, closed-form solutions revealed that smaller models maintain robustness by prioritizing critical, high-variance features. This results in smaller models being less affected by noise because they ignore less important directions where noise might reside.
Sparse Parity Classification with Multi-Head Transformers: This more complex setting deals with non-linear data features that require a two-layer transformer model. Smaller models grasp important task-relevant features. Larger models, however, learn additional, possibly noise-laden, features. This results in more distractions due to noise during the evaluation, negatively impacting the ICL performance of larger models. The optimal theoretical solution for this task showed that evaluations for smaller models are less influenced by input noise.

Experimental Validation

Empirical results support the theoretical analysis. The experiments were conducted on well-known NLP classification tasks using the Llama family of models. The findings indicate larger models, when exposed to a controlled environment where noise was systematically introduced by flipping labels, degrade more rapidly in predictive performance than smaller models. This aligns well with the theoretical insights, that larger models, despite having potentially higher capacity for feature capture, get distracted by noise more often.

Furthermore, an ablation paper provided additional insights: it was evidenced that larger models allocate attention to both relevant and irrelevant parts of input sequences, contrarily, smaller models focus more on relevant sequences.

Implications and Future Directions

The research holds implications for the development and deployment of LLMs. Understanding that larger models may overfit or be distracted by noise underlines the need for strategies to counteract these tendencies. These could include architectural innovations, training regimens that emphasize robustness, or data preprocessing that mitigates noise impact.

Theoretically, this work invites further exploration into the robustness of LLMs across different tasks with varied types and levels of noise. Practically, it suggests that in domains where input noise is substantial or control over input prompts is limited, smaller models could potentially offer more robust performance.

In conclusion, this paper enriches the comprehension of in-context learning behaviors in LLMs and underscores significant considerations for their future development and application. Refining model architecture to maintain high performance while being less noise-sensitive could be a crucial direction for researchers and practitioners alike.