Not All Language Model Features Are Linear (2405.14860v2)

Published 23 May 2024 in cs.LG

Abstract: Recent work has proposed that LLMs perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some LLM representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B. Finally, we find further circular representations by breaking down the hidden states for these tasks into interpretable components, and we examine the continuity of the days of the week feature in Mistral 7B.

Citations (33)

View on Semantic Scholar

Summary

The paper demonstrates that language models encode information using multi-dimensional features rather than solely linear representations.
It employs sparse autoencoders to uncover circular patterns in cyclic tasks, evidencing multi-dimensional structures in models like GPT-2 and Mistral 7B.
The findings imply improved AI interpretability and efficiency for tasks involving cyclic periodic data and better understanding of model inner workings.

Exploring the Multi-Dimensional Nature of LLM Features

Understanding the Basics

The traditional view of how LLMs (like GPT-2 or Mistral 7B) represent information is pretty linear: concepts and features are stored as one-dimensional lines in what's called "activation space." When these models generate text, they manipulate these one-dimensional features to perform tasks like next-word prediction or reasoning.

But what if this perspective is too limited? The paper, Not All LLM Features Are Linear, examines whether some features in LLMs can't be adequately captured by simple linear representations and instead require multi-dimensional representations. Let’s break it down.

Key Contributions

New Definitions and Hypotheses: The paper starts by defining what it means for a feature to be multi-dimensional and irreducible. If you remember Venn diagrams, think of how some features can't be broken down neatly or separated like a mix of water and oil.
Finding Multi-Dimensional Features: Using sparse autoencoders (essentially neural networks trained to condense information into a compact format), the authors identify multi-dimensional features in LLMs like GPT-2 and Mistral 7B. They find that certain concepts, like the days of the week, are represented as circles in the model’s high-dimensional space—an inherently multi-dimensional structure.
Tasks and Experiments: To test if these circular representations are actually fundamental, they look at tasks involving modular arithmetic with days of the week and months of the year. The idea is: if the model uses these circular features to solve such problems, it indicates these features are essential.

Digging Deeper with Sparse Autoencoders

Sparse autoencoders help to break down complex data into simpler components. The researchers use these autoencoders to automatically discover multi-dimensional features in GPT-2 and Mistral 7B. Interestingly, they find that:

Days of the week and months of the year form circular patterns.
These patterns are not just random but are used by the models to solve specific tasks that involve modular arithmetic.

In simpler terms, the models "think" about days and months in a cyclical manner, almost like how we naturally perceive the week's cyclic nature.

Real-World Applications and Results

The researchers design two tasks to further explore these circular features:

Weekdays Task: Questions like "Two days from Monday is...?"
Months Task: Queries such as "Four months from January is...?"

Models like Mistral 7B and Llama 3 8B perform impressively, accurately solving many instances of these tasks. Here’s a simplified summary of their accuracy:

Weekdays Task: Llama 3 8B solved 29 out of 49 problems correctly.
Months Task: Both Llama 3 8B and Mistral 7B solved over 120 out of 144 problems correctly.

These results are significant. They suggest that the circular representations aren't just an artifact of how the model stores data; rather, they form the core of how the model computes answers to certain types of problems.

Practical and Theoretical Implications

Practical: Understanding these multi-dimensional representations can help in:

Designing better interpretability tools for AI models.
Improving the efficiency and accuracy of models in tasks involving cyclic patterns, such as schedules or periodic events.

Theoretical: This work challenges the prevailing linear-only hypothesis and suggests that we may need to rethink how we model the internal workings of AI systems. It introduces a need to consider higher-dimensional interactions as fundamental building blocks of computation within these models.

Speculating on the Future

The paper opens the door for future research into other types of multi-dimensional features that might exist in LLMs. Think of potential areas like:

Geographical data processing, where locations are inherently multi-dimensional (longitude, latitude).
Complex event prediction, where overlapping cycles (like economic or social cycles) might be better encoded with multi-dimensional features.

Understanding these aspects more deeply could lead to AI systems that are not only more accurate but also more transparent and understandable.

So next time you marvel at how your AI assistant seamlessly manages your schedule, remember there's a complex, multi-dimensional dance of features happening under the hood!

PDF Markdown

Related Papers

Tweets

https://twitter.com/JoshAEngels/status/1793990584719548493

https://twitter.com/NeelNanda5/status/1803050058629444095

https://twitter.com/_akhaliq/status/1793866253599129696

https://twitter.com/s_scardapane/status/1795358506645868865

https://twitter.com/NeelNanda5/status/1817172438272090509

https://twitter.com/0xmaddie_/status/1795023040184541245

YouTube

Show All Videos

HackerNews

Not All Language Model Features Are Linear (1 point, 0 comments)