Backward Lens: Projecting Language Model Gradients into the Vocabulary Space (2402.12865v1)

Published 20 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Understanding how Transformer-based LLMs (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.

References (53)

Authors (4)

Shahar Katz (5 papers)
Yonatan Belinkov (111 papers)
Mor Geva (58 papers)
Lior Wolf (217 papers)

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space (2402.12865v1)

Summary

Related Papers

Tweets