From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning (2409.01658v2)
Abstract: LLMs tend to prioritize adherence to user prompts over providing veracious responses, leading to the sycophancy issue. When challenged by users, LLMs tend to admit mistakes and provide inaccurate responses even if they initially provided the correct answer. Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue, while it typically leads to the degeneration of LLMs' general capability. To address the challenge, we propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective. Specifically, SPT first reveals and verifies a small percentage (<5%) of the basic modules, which significantly affect a particular behavior of LLMs. i.e., sycophancy. Subsequently, SPT merely fine-tunes these identified modules while freezing the rest. To verify the effectiveness of the proposed SPT, we conduct comprehensive experiments, demonstrating that SPT significantly mitigates the sycophancy issue of LLMs (even better than SFT). Moreover, SPT introduces limited or even no side effects on the general capability of LLMs. Our results shed light on how to precisely, effectively, and efficiently explain and improve the targeted ability of LLMs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862, 2022.
- Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
- Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741, 2017.
- Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997, 2023.
- Cotra, A. Why ai alignment could be hard with modern deep learning. Cold Takes, 2021.
- Sparse autoencoders find highly interpretable features in language models. ArXiv, abs/2309.08600, 2023.
- Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
- Transformer feed-forward layers are key-value memories. ArXiv, abs/2012.14913, 2020.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. ArXiv, abs/2203.14680, 2022.
- Finding neurons in a haystack: Case studies with sparse probing. ArXiv, abs/2305.01610, 2023. URL https://api.semanticscholar.org/CorpusID:258437237.
- How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. ArXiv, abs/2305.00586, 2023.
- Measuring massive multitask language understanding. ArXiv, abs/2009.03300, 2020.
- Measuring mathematical problem solving with the math dataset. ArXiv, abs/2103.03874, 2021.
- Distributed representations. In The Philosophy of Artificial Intelligence, 1986. URL https://api.semanticscholar.org/CorpusID:50027191.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
- Attention is not explanation. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:67855860.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. ArXiv, abs/1705.03551, 2017.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114:3521 – 3526, 2016.
- Inference-time intervention: Eliciting truthful answers from a language model. ArXiv, abs/2306.03341, 2023.
- Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. ArXiv, abs/2307.09458, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. pp. 3214–3252, 2021.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. pp. 158–167, 2017.
- Optimal design of the resonant tank of the soft-switching solid-state transformer. 2019 IEEE Energy Conversion Congress and Exposition (ECCE), pp. 6965–6972, 2019.
- Distributed representations of words and phrases and their compositionality. pp. 3111–3119, 2013.
- Zoom in: An introduction to circuits. volume 5, 2020.
- OpenAI. Gpt-4 technical report. 2023.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022.
- Pearl, J. Causal diagrams for empirical research. Biometrika, 82:669–688, 1995.
- Pearl, J. The do-calculus revisited. pp. 3–11, 2012.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022a.
- Discovering language model behaviors with model-written evaluations. ArXiv, abs/2212.09251, 2022b.
- Learning to generate reviews and discovering sentiment. ArXiv, abs/1704.01444, 2017.
- Question decomposition improves the faithfulness of model-generated reasoning. ArXiv, abs/2307.11768, 2023.
- Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
- Towards understanding sycophancy in language models. ArXiv, abs/2310.13548, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.
- Attention is all you need. pp. 5998–6008, 2017.
- Investigating gender bias in language models using causal mediation analysis. 33, 2020.
- The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. pp. 4395–4405, 2019.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. ArXiv, abs/2211.00593, 2022.
- Label words are anchors: An information flow perspective for understanding in-context learning. ArXiv, abs/2305.14160, 2023.
- Chain of thought prompting elicits reasoning in large language models. ArXiv, abs/2201.11903, 2022.
- Simple synthetic data reduces sycophancy in large language models. ArXiv, abs/2308.03958, 2023.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45, 2020.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch. ArXiv, abs/2311.03099, 2023.
- How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.
- Causaladv: Adversarial robustness through the lens of causality. ICLR, 2022.
- Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 2023.
- Calibrate before use: Improving few-shot performance of language models. pp. 12697–12706, 2021.
- Representation engineering: A top-down approach to ai transparency. ArXiv, abs/2310.01405, 2023.
- Wei Chen (1290 papers)
- Zhen Huang (114 papers)
- Liang Xie (38 papers)
- Binbin Lin (50 papers)
- Houqiang Li (236 papers)
- Le Lu (148 papers)
- Xinmei Tian (50 papers)
- Deng Cai (181 papers)
- Yonggang Zhang (36 papers)
- Xu Shen (45 papers)
- Jieping Ye (169 papers)
- Wenxiao Wang (63 papers)