A Theoretical Understanding of Self-Correction through In-context Alignment (2405.18634v2)
Abstract: Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, LLMs are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we also illustrate novel applications of self-correction, such as defending against LLM jailbreaks, where a simple self-correction step does make a large difference. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models.
- Transformers learn to implement preconditioned gradient descent for in-context learning. arXiv preprint arXiv: 2306.00297, 2023a.
- Linear attention is (maybe) all you need (to understand transformer optimization). arXiv preprint arXiv: 2310.01082, 2023b.
- A closer look at in-context learning under distribution shifts. arXiv preprint arXiv: 2305.16704, 2023.
- Detecting language model attacks with perplexity. arXiv preprint arXiv: 2308.14132, 2023.
- Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932, 2024.
- Improving adversarial robustness via channel-wise activation suppressing. In ICLR, 2021.
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In NeurIPS, 2023.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv: 2212.08073, 2022.
- Understanding in-context learning in transformers and llms by learning to learn discrete functions. arXiv preprint arXiv: 2310.03016, 2023.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv: 2309.14348, 2023.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- A simple framework for contrastive learning of visual representations. ICML, 2020.
- Teaching large language models to self-debug. arXiv preprint arXiv: 2304.05128, 2023.
- When is tree search useful for llm planning? it depends on the discriminator. arXiv preprint arXiv:2402.10890, 2024.
- Exploring the robustness of in-context learning with noisy labels. In ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024.
- Transformers implement functional gradient descent to learn non-linear functions in context. arXiv preprint arXiv: 2312.06528, 2023.
- Contrastive chain-of-thought prompting. arXiv preprint arXiv: 2311.09277, 2023.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv: 2310.06474, 2023.
- Causallm is not optimal for in-context learning. arXiv preprint arXiv:2308.06912, 2023.
- Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283, 2024.
- Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv preprint arXiv: 2310.17086, 2023.
- The capacity for moral self-correction in large language models. arXiv preprint arXiv: 2302.07459, 2023.
- Rarr: Researching and revising what language models say, using language models. In ACL, 2023a.
- In-context learning for attention scheme: from single softmax regression to multiple softmax regression via a tensor trick. arXiv preprint arXiv:2307.02419, 2023b.
- What can transformers learn in-context? a case study of simple function classes. NeurIPS, 2022.
- Human-instruction-free llm self-alignment with limited samples. arXiv preprint arXiv: 2401.06785, 2024.
- Xiaochuang Han. In-context alignment: Chat with vanilla language models before fine-tuning. arXiv preprint arXiv: 2308.04275, 2023.
- Transformer language models without positional encodings still learn positional information, 2022.
- Large language models cannot self-correct reasoning yet. arXiv preprint arXiv: 2310.01798, 2023a.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv: 2310.06987, 2023b.
- In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023c.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv: 2309.00614, 2023.
- Towards mitigating hallucination in large language models via self-reflection. arXiv preprint arXiv:2310.06271, 2023.
- Mistral 7b. arXiv preprint arXiv: 2310.06825, 2023.
- Llms can find mathematical reasoning mistakes by pedagogical chain-of-thought. arXiv preprint arXiv:2405.06705, 2024.
- Language models can solve computer tasks. NEURIPS, 2023.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv: 2309.02705, 2023.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv: 2309.01446, 2023.
- Confidence matters: Revisiting intrinsic self-correction capabilities of large language models. arXiv preprint arXiv:2402.12563, 2024.
- Transformers as algorithms: Generalization and stability in in-context learning. In ICML, 2023a.
- Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv: 2309.07124, 2023b.
- The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv: 2312.01552, 2023.
- Criticbench: Benchmarking llms for critique-correct reasoning. arXiv preprint arXiv:2402.14809, 2024.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023a.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv: 2305.13860, 2023b.
- Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv: 2309.01809, 2023.
- R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005.
- Self-refine: Iterative refinement with self-feedback. NeurIPS, 2023.
- Fight back against jailbreaking via prompt adversarial tuning. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024.
- Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv: 2308.03188, 2023.
- Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv: 2110.08193, 2021.
- Transformers can optimally learn regression mixture models. arXiv preprint arXiv: 2311.08362, 2023.
- Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv: 2304.01904, 2023.
- Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202, 1975.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv: 2310.03693, 2023.
- Learning transferable visual models from natural language supervision. ICML, 2021.
- Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 2023.
- Self-reflection in llm agents: Effects on problem-solving performance. arXiv preprint arXiv:2405.06682, 2024.
- "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv: 2308.03825, 2023.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv: 2010.15980, 2020.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Preference ranking optimization for human alignment. arXiv preprint arXiv: 2306.17492, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023.
- Llms cannot find reasoning errors, but can correct them! arXiv preprint arXiv:2311.08516, 2023.
- Can large language models really improve by self-critiquing their own plans? arXiv preprint arXiv: 2310.08118, 2023.
- The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv preprint arXiv: 2401.00287, 2023.
- Attention is all you need. NeurIPS, 2017.
- Vicuna. "vicuna: An open-source chatbot impressing gpt-4 with 90quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Transformers learn in-context by gradient descent. In ICML, 2023.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. In Workshop on Efficient Systems for Foundation Models, 2023.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023.
- How many pretraining tasks are needed for in-context learning of linear regression? arXiv preprint arXiv: 2310.08391, 2023.
- Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11, 2023.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv: 2310.02446, 2023.
- GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. In ICLR, 2024.
- Trained transformers learn linear models in-context. arXiv preprint arXiv: 2306.09927, 2023a.
- Boosting jailbreak attack with momentum. In ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024.
- What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. arXiv preprint arXiv: 2305.19420, 2023b.
- Small language models need strong verifiers to self-correct reasoning. arXiv preprint arXiv: 2404.17140, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv: 2306.05685, 2023.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv: 2310.15140, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv: 2307.15043, 2023.
- Yifei Wang (141 papers)
- Yuyang Wu (9 papers)
- Zeming Wei (24 papers)
- Stefanie Jegelka (122 papers)
- Yisen Wang (120 papers)