An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts (2505.11924v1)

Published 17 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: We provide an explanation for the performance gains of intrinsic self-correction, a process where a LLM iteratively refines its outputs without external feedback. More precisely, we investigate how prompting induces interpretable changes in hidden states and thus affects the output distributions. We hypothesize that each prompt-induced shift lies in a linear span of some linear representation vectors, naturally separating tokens based on individual concept alignment. Building around this idea, we give a mathematical formulation of self-correction and derive a concentration result for output tokens based on alignment magnitudes. Our experiments on text detoxification with zephyr-7b-sft reveal a substantial gap in the inner products of the prompt-induced shifts and the unembeddings of the top-100 most toxic tokens vs. those of the unembeddings of the bottom-100 least toxic tokens, under toxic instructions. This suggests that self-correction prompts enhance a LLM's capability of latent concept recognition. Our analysis offers insights into the underlying mechanism of self-correction by characterizing how prompting works explainably. For reproducibility, our code is available.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

An Explanation of Intrinsic Self-Correction via Linear Representations and Latent Concepts (2505.11924v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (4)