Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

187 1

In-Context Language Learning: Architectures and Algorithms (2401.12973v2)

Published 23 Jan 2024 in cs.CL and cs.LG

Abstract: Large-scale neural LLMs exhibit a remarkable capacity for in-context learning (ICL): they can infer novel functions from datasets provided as input. Most of our current understanding of when and how ICL arises comes from LMs trained on extremely simple learning problems like linear regression and associative recall. There remains a significant gap between these model problems and the "real" ICL exhibited by LMs trained on large text corpora, which involves not just retrieval and function approximation but free-form generation of language and other structured outputs. In this paper, we study ICL through the lens of a new family of model problems we term in context language learning (ICLL). In ICLL, LMs are presented with a set of strings from a formal language, and must generate additional strings from the same language. We focus on in-context learning of regular languages generated by random finite automata. We evaluate a diverse set of neural sequence models (including several RNNs, Transformers, and state-space model variants) on regular ICLL tasks, aiming to answer three questions: (1) Which model classes are empirically capable of ICLL? (2) What algorithmic solutions do successful models implement to perform ICLL? (3) What architectural changes can improve ICLL in less performant models? We first show that Transformers significantly outperform neural sequence models with recurrent or convolutional representations on ICLL tasks. Next, we provide evidence that their ability to do so relies on specialized "n-gram heads" (higher-order variants of induction heads) that compute input-conditional next-token distributions. Finally, we show that hard-wiring these heads into neural models improves performance not just on ICLL, but natural LLMing -- improving the perplexity of 340M-parameter models by up to 1.14 points (6.7%) on the SlimPajama dataset.

PDF HTML Abstract

Introduction

The advent of powerful neural LLMs has been marked by a growing interest in in-context learning (ICL), where models adapt to new functions or distributions based on provided examples. However, understanding and improving ICL in the field of large-scale LLMs remains a complex challenge. To address this, researchers have begun to hone in on in-context language learning (ICLL) as a means of investigating the capacity for LLMs to reason compositionally about sequences within formal languages—a subset of the broader ICL phenomenon.

ICLL Model Problems

ICLL model problems serve as a structured framework for probing neural networks' abilities to classify and generate language strings belonging to an unfamiliar formal language. Researchers define ICLL as the task where models are given strings sampled from a randomly generated language and must deduce the underlying distribution. This approach advances the paper of ICL by presenting linguistically structured yet compositionally complex problems, reflective of tasks faced by large-scale LLMs.

Methodology

Seeking to decode the proficiency of different neural architectures in ICLL tasks, a systematic experiment was conducted evaluating various sequence models, ranging from traditional RNNs and Transformers to novel state-space variants. These models were challenged with tasks derived from regular languages represented by probabilistic finite automata. The paper pursued three objectives: assessing which classes of models could efficiently conduct ICLL, uncovering the algorithmic solutions and circuits implemented by successful models, and exploring whether insights into ICLL processes could inform architectural advancements.

Results and Insights

The findings were multifaceted. Transformers demonstrated a superior capability in ICLL tasks compared to their recurrent and convolutional counterparts. Their prominence was ascribed in part to specialized "n-gram heads" that distilled next-token distributions by conditioning on preceding strings of tokens—akin to how n-gram models function. Through robust analysis including attention mechanisms, representational probing, and behavioral evaluation, these n-gram heads were pinpointed as a cornerstone of effective ICLL within Transformers.

Architectural Improvements

Drawing from these insights, researchers devised an architectural integration strategy wherein n-gram heads were inserted into both Transformers and non-Transformer architectures. This augmentation not only bolstered performance in artificial ICLL tasks but also enhanced perplexity scores in real-world LLMing. The success of these n-gram head insertions substantiates the idea that LLMs may benefit from incorporating explicit mechanisms reminiscent of more traditional LLMing algorithms.

Conclusion

The exploration into ICLL provides a clearer picture of how large-scale LLMs manage ICL. The introduction of n-gram heads crystallizes the notion that substantial attributes of language learning stem from mechanisms both new and old, challenging and advancing our comprehension of neural sequence models' in-context learning capabilities.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (4)

Ekin Akyürek (25 papers)
Bailin Wang (34 papers)
Yoon Kim (92 papers)
Jacob Andreas (116 papers)

Citations (28)

View on Semantic Scholar

Tweets

https://twitter.com/jacobandreas/status/1815064056157384868

https://twitter.com/CFGeek/status/1751660943946453261

https://twitter.com/EyubogluSabri/status/1757463141653655659

https://twitter.com/akyurekekin/status/1751987013337706749

https://twitter.com/lambdaviking/status/1783584640839688342

https://twitter.com/semisance/status/1750473600576967040

YouTube

Show All Videos